T5 Russian Summarization
T5 model for correcting audio transcriptions, capable of handling Russian text summarization and spelling correction
Downloads 829
Release Time : 4/2/2022
Model Overview
This model is primarily used for generating summaries and correcting spelling in Russian audio transcriptions, designed to work in conjunction with the wav2vec2-russian speech recognition model
Model Features
Russian text processing
Specialized summarization and correction capabilities optimized for Russian text
Integration with audio models
Can be paired with the wav2vec2-russian speech recognition model to form a complete audio processing pipeline
Multi-task processing
Simultaneously supports both text summarization and spelling correction functionalities
Model Capabilities
Russian text summarization
Spelling correction
Post-processing of audio transcriptions
Use Cases
Media content processing
News summarization
Automatically generates concise summaries from Russian news audio transcriptions
As shown in examples, can compress lengthy news texts into brief summaries
Speech recognition post-processing
Audio transcription correction
Corrects errors in Russian text output from speech recognition models
🚀 t5-russian-summarization
This is a model for correcting text from recognized audio. You can use the results of my audio recognition model UrukHan/wav2vec2-russian as input for this model. I've tested it on random YouTube videos.
🚀 Quick Start
Input and Output Example
Input | Output |
---|---|
After the start of Russia's special military operation to demilitarize Ukraine, the West imposed several rounds of new economic sanctions. The Kremlin called the new restrictions serious but noted that Russia had prepared for them in advance. | The West imposed new sanctions against Russia |
Datasets for Training
- UrukHan/t5-russian-summarization: https://huggingface.co/datasets/UrukHan/t5-russian-summarization
Example of Running the Model with Comments in Colab
https://colab.research.google.com/drive/1ame2va9_NflYqy4RZ07HYmQ0moJYy7w2?usp=sharing
# Install the transformers library
!pip install transformers
# Import libraries
from transformers import AutoModelForSeq2SeqLM, T5TokenizerFast
# Set the name of the selected model from the hub
MODEL_NAME = 'UrukHan/t5-russian-summarization'
MAX_INPUT = 256
# Load the model and tokenizer
tokenizer = T5TokenizerFast.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
# Input data (can be an array of phrases or text)
input_sequences = ['After the start of Russia\'s special military operation to demilitarize Ukraine, the West imposed several rounds of new economic sanctions. The Kremlin called the new restrictions serious but noted that Russia had prepared for them in advance.'] # Or you can use a single phrase: input_sequences = 'Today is a good day'
task_prefix = "Spell correct: " # Tokenize the data
if type(input_sequences) != list:
input_sequences = [input_sequences]
encoded = tokenizer(
[task_prefix + sequence for sequence in input_sequences],
padding="longest",
max_length=MAX_INPUT,
truncation=True,
return_tensors="pt",
)
predicts = model.generate(encoded) # Make predictions
tokenizer.batch_decode(predicts, skip_special_tokens=True) # Decode the data
Notebook for Training and Saving the Model
https://colab.research.google.com/drive/1H4IoasDqa2TEjGivVDp-4Pdpm0oxrCWd?usp=sharing
# Install libraries
!pip install datasets
!apt install git-lfs
!pip install transformers
!pip install sentencepiece
!pip install rouge_score
# Import libraries
import numpy as np
from datasets import Dataset
import tensorflow as
import nltk
from transformers import T5TokenizerFast, Seq2SeqTrainingArguments, Seq2SeqTrainer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
import torch
from transformers.optimization import Adafactor, AdafactorSchedule
from datasets import load_dataset, load_metric
# Load parameters
raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")
nltk.download('punkt')
# Enter your Hugging Face Hub key
from huggingface_hub import notebook_login
notebook_login()
# Define parameters
REPO = "t5-russian-summarization" # Enter the name of the repository
MODEL_NAME = "UrukHan/t5-russian-summarization" # Enter the name of the selected model from the hub
MAX_INPUT = 256 # Enter the maximum length of input data in tokens (the length of input phrases in words (you can count half a word as a token))
MAX_OUTPUT = 64 # Enter the maximum length of predictions in tokens (you can reduce it for summarization tasks or other tasks where the output is shorter)
BATCH_SIZE = 8
DATASET = 'UrukHan/t5-russian-summarization' # Enter the name of the dataset
# Load the dataset. I'll describe the use of other data types below
data = load_dataset(DATASET)
# Load the model and tokenizer
tokenizer = T5TokenizerFast.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
model.config.max_length = MAX_OUTPUT # By default, it's 20, so the output sequences of predictions are truncated in all models
# Comment this out after the first save to the repository. It's optional
tokenizer.push_to_hub(repo_name)
train = data['train']
test = data['test'].train_test_split(0.02)['test'] # I reduced the test set so as not to wait too long for the error calculation between epochs
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model) # return_tensors="tf"
def compute_metrics(eval_pred):
predictions, labels = eval_pred
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# Rouge expects a newline after each sentence
decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
# Extract a few results
result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
# Add mean generated length
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
result["gen_len"] = np.mean(prediction_lens)
return {k: round(v, 4) for k, v in result.items()}
training_args = Seq2SeqTrainingArguments(
output_dir=REPO,
# overwrite_output_dir=True,
evaluation_strategy='steps',
# learning_rate=2e-5,
eval_steps=5000,
save_steps=5000,
num_train_epochs=1,
predict_with_generate=True,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
fp16=True,
save_total_limit=2,
# generation_max_length=256,
# generation_num_beams=4,
weight_decay=0.005,
# logging_dir='logs',
push_to_hub=True,
)
# Manually select the optimizer. The original T5 architecture uses the Adafactor optimizer
optimizer = Adafactor(
model.parameters(),
lr=1e-5,
eps=(1e-30, 1e-3),
clip_threshold=1.0,
decay_rate=-0.8,
beta1=None,
weight_decay=0.0,
relative_step=False,
scale_parameter=False,
warmup_init=False,
)
lr_scheduler = AdafactorSchedule(optimizer)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train,
eval_dataset=test,
optimizers=(optimizer, lr_scheduler),
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()
trainer.push_to_hub()
Example of Converting Arrays for the Network
input_data = ['After the start of Russia\'s special military operation to demilitarize Ukraine, the West imposed several rounds of new economic sanctions. The Kremlin called the new restrictions serious but noted that Russia had prepared for them in advance.']
output_data = ['The West imposed new sanctions against Russia']
# Tokenize the input data
task_prefix = "Spell correct: "
input_sequences = input_data
encoding = tokenizer(
[task_prefix + sequence for sequence in input_sequences],
padding="longest",
max_length=MAX_INPUT,
truncation=True,
return_tensors="pt",
)
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
# Tokenize the output data
target_encoding = tokenizer(output_data, padding="longest", max_length=MAX_OUTPUT, truncation=True)
labels = target_encoding.input_ids
# Replace padding token id's of the labels by -100
labels = torch.tensor(labels)
labels[labels == tokenizer.pad_token_id] = -100
# Convert our data to the dataset format
data = Dataset.from_pandas(pd.DataFrame({'input_ids': list(np.array(input_ids)), 'attention_mask': list(np.array(attention_mask)), 'labels': list(np.array(labels))}))
data = data.train_test_split(0.02)
# And we'll get the input for our trainer: train_dataset = data['train'], eval_dataset = data['test']
Bart Large Cnn
MIT
BART model pre-trained on English corpus, specifically fine-tuned for the CNN/Daily Mail dataset, suitable for text summarization tasks
Text Generation English
B
facebook
3.8M
1,364
Parrot Paraphraser On T5
Parrot is a T5-based paraphrasing framework designed to accelerate the training of Natural Language Understanding (NLU) models through high-quality paraphrase generation for data augmentation.
Text Generation
Transformers

P
prithivida
910.07k
152
Distilbart Cnn 12 6
Apache-2.0
DistilBART is a distilled version of the BART model, specifically optimized for text summarization tasks, significantly improving inference speed while maintaining high performance.
Text Generation English
D
sshleifer
783.96k
278
T5 Base Summarization Claim Extractor
A T5-based model specialized in extracting atomic claims from summary texts, serving as a key component in summary factuality assessment pipelines.
Text Generation
Transformers English

T
Babelscape
666.36k
9
Unieval Sum
UniEval is a unified multidimensional evaluator for automatic evaluation of natural language generation tasks, supporting assessment across multiple interpretable dimensions.
Text Generation
Transformers

U
MingZhong
318.08k
3
Pegasus Paraphrase
Apache-2.0
A text paraphrasing model fine-tuned based on the PEGASUS architecture, capable of generating sentences with the same meaning but different expressions.
Text Generation
Transformers English

P
tuner007
209.03k
185
T5 Base Korean Summarization
This is a Korean text summarization model based on the T5 architecture, specifically designed for Korean text summarization tasks. It is trained on multiple Korean datasets by fine-tuning the paust/pko-t5-base model.
Text Generation
Transformers Korean

T
eenzeenee
148.32k
25
Pegasus Xsum
PEGASUS is a Transformer-based pretrained model specifically designed for abstractive text summarization tasks.
Text Generation English
P
google
144.72k
198
Bart Large Cnn Samsum
MIT
A dialogue summarization model based on the BART-large architecture, fine-tuned specifically for the SAMSum corpus, suitable for generating dialogue summaries.
Text Generation
Transformers English

B
philschmid
141.28k
258
Kobart Summarization
MIT
A Korean text summarization model based on the KoBART architecture, capable of generating concise summaries of Korean news articles.
Text Generation
Transformers Korean

K
gogamza
119.18k
12
Featured Recommended AI Models