T5 Russian Spell
A Russian spelling correction model for text derived from audio recognition, can be used in conjunction with wav2vec2 audio recognition models
Downloads 1,910
Release Time : 3/29/2022
Model Overview
This model is primarily used to correct text output from Russian speech recognition systems (such as wav2vec2), improving the accuracy of recognition results. Tested and validated on random YouTube videos.
Model Features
Audio recognition text correction
Specifically designed for spelling and grammar correction of Russian text output from speech recognition systems
Integration with wav2vec2
Can be seamlessly used with the UrukHan/wav2vec2-russian speech recognition model
Context-aware correction
Capable of understanding context for more accurate spelling and grammar corrections
Model Capabilities
Russian text spelling correction
Grammar error correction
Speech recognition post-processing
Text normalization
Use Cases
Speech recognition enhancement
Speech recognition result correction
Automatically corrects Russian text output from speech recognition systems
Significantly improves the accuracy and readability of speech recognition text
Content processing
Automatic subtitle correction
Corrects spelling and grammar errors in automatically generated video subtitles
Improves subtitle quality and viewing experience
🚀 t5-russian-spell
A model for correcting text from recognized audio. You can feed the results of my audio recognition model into this model.
🚀 Quick Start
This model, t5-russian-spell
, is designed to correct text from recognized audio. You can use the results of the audio recognition model UrukHan/wav2vec2-russian as input for this model.
Example Comparison
Output wav2vec2 | Output spell corrector |
---|---|
ывсем привет выныканалетоп армии и это двадцать пятый день спец операций на украине ет самый главной новости российские военные ракетами кинжалы калибр уничтожили крупную военную топливную базу украины ракетным ударом по населенному пункту под жетамиром уничтжены более стаукраинских военных в две тысячи двадцать втором году | Всем привет! Вы в курсе новостей от армии. И это 25 день спецопераций на Украине. Есть самые главные новости. Российские военные ракетами «Кинжалы» и «Кинжалы» калибра уничтожили крупную военную топливную базу Украины. Ракетным ударом по населенному пункту под Жетамиром уничтожены более ста украинских военных в 2022г. |
Datasets for Training
💻 Usage Examples
Basic Usage
# Install the transformers library
!pip install transformers
# Import libraries
from transformers import AutoModelForSeq2SeqLM, T5TokenizerFast
# Set the name of the selected model from the hub
MODEL_NAME = 'UrukHan/t5-russian-spell'
MAX_INPUT = 256
# Load the model and tokenizer
tokenizer = T5TokenizerFast.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
# Input data (can be an array of phrases or text)
input_sequences = ['сеглдыя хорош ден', 'когд а вы прдет к нам в госи'] # or you can use single phrases: input_sequences = 'сеглдыя хорош ден'
task_prefix = "Spell correct: " # Tokenize the data
if type(input_sequences) != list: input_sequences = [input_sequences]
encoded = tokenizer(
[task_prefix + sequence for sequence in input_sequences],
padding="longest",
max_length=MAX_INPUT,
truncation=True,
return_tensors="pt",
)
predicts = model.generate(encoded) # # Prediction
tokenizer.batch_decode(predicts, skip_special_tokens=True) # Decode the data
Advanced Usage
You can use the following Colab notebook to run the training and save the model to your own repository on the Hugging Face Hub: Colab Notebook
# Install libraries
!pip install datasets
!apt install git-lfs
!pip install transformers
!pip install sentencepiece
!pip install rouge_score
# Import libraries
import numpy as np
from datasets import Dataset
import tensorflow as
import nltk
from transformers import T5TokenizerFast, Seq2SeqTrainingArguments, Seq2SeqTrainer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
import torch
from transformers.optimization import Adafactor, AdafactorSchedule
from datasets import load_dataset, load_metric
# Load parameters
raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")
nltk.download('punkt')
# Enter your Hugging Face Hub key
from huggingface_hub import notebook_login
notebook_login()
# Define parameters
REPO = "t5-russian-spell" # Enter the name of your repository
MODEL_NAME = "UrukHan/t5-russian-spell" # Enter the name of the selected model from the hub
MAX_INPUT = 256 # Enter the maximum length of input data in tokens (the length of input phrases in words (you can count half a word as a token))
MAX_OUTPUT = 256 # Enter the maximum length of predictions in tokens (you can reduce it for summarization tasks or other tasks where the output is shorter)
BATCH_SIZE = 8
DATASET = 'UrukHan/t5-russian-spell_I' # Enter the name of the dataset
# Load the dataset. I will describe the use of other types of data below
data = load_dataset(DATASET)
# Load the model and tokenizer
tokenizer = T5TokenizerFast.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
model.config.max_length = MAX_OUTPUT # By default, it is 20, so the output sequences are truncated in all models
# Comment out after the first save to the repository. It is optional
tokenizer.push_to_hub(repo_name)
train = data['train']
test = data['test'].train_test_split(0.02)['test'] # I reduced the test set so that I don't have to wait too long for the error calculation between epochs
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model) #return_tensors="tf"
def compute_metrics(eval_pred):
predictions, labels = eval_pred
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# Rouge expects a newline after each sentence
decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
# Extract a few results
result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
# Add mean generated length
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
result["gen_len"] = np.mean(prediction_lens)
return {k: round(v, 4) for k, v in result.items()}
training_args = Seq2SeqTrainingArguments(
output_dir = REPO,
#overwrite_output_dir=True,
evaluation_strategy='steps',
#learning_rate=2e-5,
eval_steps=5000,
save_steps=5000,
num_train_epochs=1,
predict_with_generate=True,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
fp16=True,
save_total_limit=2,
#generation_max_length=256,
#generation_num_beams=4,
weight_decay=0.005,
#logging_dir='logs',
push_to_hub=True,
)
# Manually select the optimizer. The original architecture of T5 uses the Adafactor optimizer
optimizer = Adafactor(
model.parameters(),
lr=1e-5,
eps=(1e-30, 1e-3),
clip_threshold=1.0,
decay_rate=-0.8,
beta1=None,
weight_decay=0.0,
relative_step=False,
scale_parameter=False,
warmup_init=False,
)
lr_scheduler = AdafactorSchedule(optimizer)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset = train,
eval_dataset = test,
optimizers = (optimizer, lr_scheduler),
tokenizer = tokenizer,
compute_metrics=compute_metrics
)
trainer.train()
trainer.push_to_hub()
Example of Converting Arrays for this Network
input_data = ['удач почти отнее отвернулась', 'в хааоде проведения чемпиониавта мира дветысячивосемнандцтая лгодаа']
output_data = ['Удача почти от нее отвернулась', 'в ходе проведения чемпионата мира две тысячи восемнадцатого года']
# Tokenize the input data
task_prefix = "Spell correct: "
input_sequences = input_data
encoding = tokenizer(
[task_prefix + sequence for sequence in input_sequences],
padding="longest",
max_length=MAX_INPUT,
truncation=True,
return_tensors="pt",
)
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
# Tokenize the output data
target_encoding = tokenizer(output_data, padding="longest", max_length=MAX_OUTPUT, truncation=True)
labels = target_encoding.input_ids
# replace padding token id's of the labels by -100
labels = torch.tensor(labels)
labels[labels == tokenizer.pad_token_id] = -100
# Convert our data to the dataset format
data = Dataset.from_pandas(pd.DataFrame({'input_ids': list(np.array(input_ids)), 'attention_mask': list(np.array(attention_mask)), 'labels': list(np.array(labels))}))
data = data.train_test_split(0.02)
# and we will get the input for our network for our trainer: train_dataset = data['train'], eval_dataset = data['test']
Bart Large Cnn
MIT
BART model pre-trained on English corpus, specifically fine-tuned for the CNN/Daily Mail dataset, suitable for text summarization tasks
Text Generation English
B
facebook
3.8M
1,364
Parrot Paraphraser On T5
Parrot is a T5-based paraphrasing framework designed to accelerate the training of Natural Language Understanding (NLU) models through high-quality paraphrase generation for data augmentation.
Text Generation
Transformers

P
prithivida
910.07k
152
Distilbart Cnn 12 6
Apache-2.0
DistilBART is a distilled version of the BART model, specifically optimized for text summarization tasks, significantly improving inference speed while maintaining high performance.
Text Generation English
D
sshleifer
783.96k
278
T5 Base Summarization Claim Extractor
A T5-based model specialized in extracting atomic claims from summary texts, serving as a key component in summary factuality assessment pipelines.
Text Generation
Transformers English

T
Babelscape
666.36k
9
Unieval Sum
UniEval is a unified multidimensional evaluator for automatic evaluation of natural language generation tasks, supporting assessment across multiple interpretable dimensions.
Text Generation
Transformers

U
MingZhong
318.08k
3
Pegasus Paraphrase
Apache-2.0
A text paraphrasing model fine-tuned based on the PEGASUS architecture, capable of generating sentences with the same meaning but different expressions.
Text Generation
Transformers English

P
tuner007
209.03k
185
T5 Base Korean Summarization
This is a Korean text summarization model based on the T5 architecture, specifically designed for Korean text summarization tasks. It is trained on multiple Korean datasets by fine-tuning the paust/pko-t5-base model.
Text Generation
Transformers Korean

T
eenzeenee
148.32k
25
Pegasus Xsum
PEGASUS is a Transformer-based pretrained model specifically designed for abstractive text summarization tasks.
Text Generation English
P
google
144.72k
198
Bart Large Cnn Samsum
MIT
A dialogue summarization model based on the BART-large architecture, fine-tuned specifically for the SAMSum corpus, suitable for generating dialogue summaries.
Text Generation
Transformers English

B
philschmid
141.28k
258
Kobart Summarization
MIT
A Korean text summarization model based on the KoBART architecture, capable of generating concise summaries of Korean news articles.
Text Generation
Transformers Korean

K
gogamza
119.18k
12
Featured Recommended AI Models