Open-source Russian spelling correction model - t5-russian-spell: Correct spelling errors in audio recognition texts

T5 Russian Spell

Developed by UrukHan

A Russian spelling correction model for text derived from audio recognition, can be used in conjunction with wav2vec2 audio recognition models

Text Generation

Transformers

#Russian spelling correction #Speech recognition post-processing #T5 model optimization

Downloads 1,910

Release Time : 3/29/2022

Model Overview

This model is primarily used to correct text output from Russian speech recognition systems (such as wav2vec2), improving the accuracy of recognition results. Tested and validated on random YouTube videos.

Model Features

Audio recognition text correction

Specifically designed for spelling and grammar correction of Russian text output from speech recognition systems

Integration with wav2vec2

Can be seamlessly used with the UrukHan/wav2vec2-russian speech recognition model

Context-aware correction

Capable of understanding context for more accurate spelling and grammar corrections

Model Capabilities

Russian text spelling correction

Grammar error correction

Speech recognition post-processing

Text normalization

Use Cases

Speech recognition enhancement

Speech recognition result correction

Automatically corrects Russian text output from speech recognition systems

Significantly improves the accuracy and readability of speech recognition text

Content processing

Automatic subtitle correction

Corrects spelling and grammar errors in automatically generated video subtitles

Improves subtitle quality and viewing experience

🚀 t5-russian-spell

A model for correcting text from recognized audio. You can feed the results of my audio recognition model into this model.

🚀 Quick Start

This model, t5-russian-spell, is designed to correct text from recognized audio. You can use the results of the audio recognition model UrukHan/wav2vec2-russian as input for this model.

Example Comparison

Output wav2vec2	Output spell corrector
ывсем привет выныканалетоп армии и это двадцать пятый день спец операций на украине ет самый главной новости российские военные ракетами кинжалы калибр уничтожили крупную военную топливную базу украины ракетным ударом по населенному пункту под жетамиром уничтжены более стаукраинских военных в две тысячи двадцать втором году	Всем привет! Вы в курсе новостей от армии. И это 25 день спецопераций на Украине. Есть самые главные новости. Российские военные ракетами «Кинжалы» и «Кинжалы» калибра уничтожили крупную военную топливную базу Украины. Ракетным ударом по населенному пункту под Жетамиром уничтожены более ста украинских военных в 2022г.

Output wav2vec2

Output spell corrector

ывсем привет выныканалетоп армии и это двадцать пятый день спец операций на украине ет самый главной новости российские военные ракетами кинжалы калибр уничтожили крупную военную топливную базу украины ракетным ударом по населенному пункту под жетамиром уничтжены более стаукраинских военных в две тысячи двадцать втором году

Всем привет! Вы в курсе новостей от армии. И это 25 день спецопераций на Украине. Есть самые главные новости. Российские военные ракетами «Кинжалы» и «Кинжалы» калибра уничтожили крупную военную топливную базу Украины. Ракетным ударом по населенному пункту под Жетамиром уничтожены более ста украинских военных в 2022г.

Datasets for Training

💻 Usage Examples

Basic Usage

# Install the transformers library
!pip install transformers

# Import libraries
from transformers import AutoModelForSeq2SeqLM, T5TokenizerFast

# Set the name of the selected model from the hub
MODEL_NAME = 'UrukHan/t5-russian-spell'
MAX_INPUT = 256

# Load the model and tokenizer
tokenizer = T5TokenizerFast.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

# Input data (can be an array of phrases or text)
input_sequences = ['сеглдыя хорош ден', 'когд а вы прдет к нам в госи']   # or you can use single phrases:  input_sequences = 'сеглдыя хорош ден'

task_prefix = "Spell correct: "                 # Tokenize the data
if type(input_sequences) != list: input_sequences = [input_sequences]
encoded = tokenizer(
  [task_prefix + sequence for sequence in input_sequences],
  padding="longest",
  max_length=MAX_INPUT,
  truncation=True,
  return_tensors="pt",
)

predicts = model.generate(encoded)    # # Prediction

tokenizer.batch_decode(predicts, skip_special_tokens=True)  # Decode the data

Advanced Usage

You can use the following Colab notebook to run the training and save the model to your own repository on the Hugging Face Hub: Colab Notebook

# Install libraries
!pip install datasets
!apt install git-lfs
!pip install transformers
!pip install sentencepiece 
!pip install rouge_score

# Import libraries
import numpy as np
from datasets import Dataset
import tensorflow as 
import nltk
from transformers import T5TokenizerFast, Seq2SeqTrainingArguments, Seq2SeqTrainer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
import torch
from transformers.optimization import Adafactor, AdafactorSchedule
from datasets import load_dataset, load_metric

# Load parameters
raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")
nltk.download('punkt')

# Enter your Hugging Face Hub key
from huggingface_hub import notebook_login
notebook_login()

# Define parameters
REPO = "t5-russian-spell"  # Enter the name of your repository
MODEL_NAME = "UrukHan/t5-russian-spell" # Enter the name of the selected model from the hub
MAX_INPUT = 256  # Enter the maximum length of input data in tokens (the length of input phrases in words (you can count half a word as a token))
MAX_OUTPUT  = 256 # Enter the maximum length of predictions in tokens (you can reduce it for summarization tasks or other tasks where the output is shorter)
BATCH_SIZE = 8 
DATASET = 'UrukHan/t5-russian-spell_I'   # Enter the name of the dataset

# Load the dataset. I will describe the use of other types of data below
data = load_dataset(DATASET)

# Load the model and tokenizer
tokenizer = T5TokenizerFast.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

model.config.max_length = MAX_OUTPUT  # By default, it is 20, so the output sequences are truncated in all models
# Comment out after the first save to the repository. It is optional
tokenizer.push_to_hub(repo_name) 

train = data['train']
test = data['test'].train_test_split(0.02)['test']  # I reduced the test set so that I don't have to wait too long for the error calculation between epochs

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model) #return_tensors="tf"

def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
  # Replace -100 in the labels as we can't decode them.
  labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
  
  # Rouge expects a newline after each sentence
  decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
  decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
  
  result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
  # Extract a few results
  result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
  
  # Add mean generated length
  prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
  result["gen_len"] = np.mean(prediction_lens)
  
  return {k: round(v, 4) for k, v in result.items()}
  
training_args = Seq2SeqTrainingArguments(
  output_dir = REPO,
  #overwrite_output_dir=True,
  evaluation_strategy='steps',
  #learning_rate=2e-5,
  eval_steps=5000,
  save_steps=5000,
  num_train_epochs=1,
  predict_with_generate=True,
  per_device_train_batch_size=BATCH_SIZE,
  per_device_eval_batch_size=BATCH_SIZE,
  fp16=True,
  save_total_limit=2,
  #generation_max_length=256,
  #generation_num_beams=4,
  weight_decay=0.005,
  #logging_dir='logs',
  push_to_hub=True,
)

# Manually select the optimizer. The original architecture of T5 uses the Adafactor optimizer
optimizer = Adafactor(
    model.parameters(),
    lr=1e-5,
    eps=(1e-30, 1e-3),
    clip_threshold=1.0,
    decay_rate=-0.8,
    beta1=None,
    weight_decay=0.0,
    relative_step=False,
    scale_parameter=False,
    warmup_init=False,
)
lr_scheduler = AdafactorSchedule(optimizer)

trainer = Seq2SeqTrainer(
  model=model,
  args=training_args,
  train_dataset = train,
  eval_dataset = test,
  optimizers = (optimizer, lr_scheduler),
  tokenizer = tokenizer,
  compute_metrics=compute_metrics
)

trainer.train()

trainer.push_to_hub()

Example of Converting Arrays for this Network

input_data = ['удач почти отнее отвернулась', 'в хааоде проведения чемпиониавта мира дветысячивосемнандцтая лгодаа']
output_data = ['Удача почти от нее отвернулась', 'в ходе проведения чемпионата мира две тысячи восемнадцатого года']

# Tokenize the input data
task_prefix = "Spell correct: "
input_sequences = input_data 
encoding = tokenizer(
  [task_prefix + sequence for sequence in input_sequences],
  padding="longest",
  max_length=MAX_INPUT,
  truncation=True,
  return_tensors="pt",
)
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask

# Tokenize the output data
target_encoding = tokenizer(output_data, padding="longest", max_length=MAX_OUTPUT, truncation=True)
labels = target_encoding.input_ids
# replace padding token id's of the labels by -100
labels = torch.tensor(labels)
labels[labels == tokenizer.pad_token_id] = -100

# Convert our data to the dataset format   

data = Dataset.from_pandas(pd.DataFrame({'input_ids': list(np.array(input_ids)), 'attention_mask': list(np.array(attention_mask)), 'labels': list(np.array(labels))}))
data = data.train_test_split(0.02)
# and we will get the input for our network for our trainer:   train_dataset = data['train'],  eval_dataset = data['test']

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご