t5-russian-summarization Open Source Model - Free Proofreading of Russian Texts and Summarization Available

Home

T5 Russian Summarization

Developed by UrukHan

T5 model for correcting audio transcriptions, capable of handling Russian text summarization and spelling correction

Text Generation

Transformers

#Russian text summarization #Audio transcription correction #T5 model fine-tuning

Downloads 829

Release Time : 4/2/2022

Model Overview

This model is primarily used for generating summaries and correcting spelling in Russian audio transcriptions, designed to work in conjunction with the wav2vec2-russian speech recognition model

Model Features

Russian text processing

Specialized summarization and correction capabilities optimized for Russian text

Integration with audio models

Can be paired with the wav2vec2-russian speech recognition model to form a complete audio processing pipeline

Multi-task processing

Simultaneously supports both text summarization and spelling correction functionalities

Model Capabilities

Russian text summarization

Spelling correction

Post-processing of audio transcriptions

Use Cases

Media content processing

News summarization

Automatically generates concise summaries from Russian news audio transcriptions

As shown in examples, can compress lengthy news texts into brief summaries

Speech recognition post-processing

Audio transcription correction

Corrects errors in Russian text output from speech recognition models

🚀 t5-russian-summarization

This is a model for correcting text from recognized audio. You can use the results of my audio recognition model UrukHan/wav2vec2-russian as input for this model. I've tested it on random YouTube videos.

🚀 Quick Start

Input and Output Example

Input	Output
After the start of Russia's special military operation to demilitarize Ukraine, the West imposed several rounds of new economic sanctions. The Kremlin called the new restrictions serious but noted that Russia had prepared for them in advance.	The West imposed new sanctions against Russia

Datasets for Training

UrukHan/t5-russian-summarization: https://huggingface.co/datasets/UrukHan/t5-russian-summarization

Example of Running the Model with Comments in Colab

https://colab.research.google.com/drive/1ame2va9_NflYqy4RZ07HYmQ0moJYy7w2?usp=sharing

# Install the transformers library
!pip install transformers

# Import libraries
from transformers import AutoModelForSeq2SeqLM, T5TokenizerFast

# Set the name of the selected model from the hub
MODEL_NAME = 'UrukHan/t5-russian-summarization'
MAX_INPUT = 256

# Load the model and tokenizer
tokenizer = T5TokenizerFast.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

# Input data (can be an array of phrases or text)
input_sequences = ['After the start of Russia\'s special military operation to demilitarize Ukraine, the West imposed several rounds of new economic sanctions. The Kremlin called the new restrictions serious but noted that Russia had prepared for them in advance.']  # Or you can use a single phrase: input_sequences = 'Today is a good day'

task_prefix = "Spell correct: "  # Tokenize the data
if type(input_sequences) != list:
    input_sequences = [input_sequences]
encoded = tokenizer(
    [task_prefix + sequence for sequence in input_sequences],
    padding="longest",
    max_length=MAX_INPUT,
    truncation=True,
    return_tensors="pt",
)

predicts = model.generate(encoded)  # Make predictions

tokenizer.batch_decode(predicts, skip_special_tokens=True)  # Decode the data

Notebook for Training and Saving the Model

https://colab.research.google.com/drive/1H4IoasDqa2TEjGivVDp-4Pdpm0oxrCWd?usp=sharing

# Install libraries
!pip install datasets
!apt install git-lfs
!pip install transformers
!pip install sentencepiece 
!pip install rouge_score

# Import libraries
import numpy as np
from datasets import Dataset
import tensorflow as 
import nltk
from transformers import T5TokenizerFast, Seq2SeqTrainingArguments, Seq2SeqTrainer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
import torch
from transformers.optimization import Adafactor, AdafactorSchedule
from datasets import load_dataset, load_metric

# Load parameters
raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")
nltk.download('punkt')

# Enter your Hugging Face Hub key
from huggingface_hub import notebook_login
notebook_login()

# Define parameters
REPO = "t5-russian-summarization"  # Enter the name of the repository
MODEL_NAME = "UrukHan/t5-russian-summarization"  # Enter the name of the selected model from the hub
MAX_INPUT = 256  # Enter the maximum length of input data in tokens (the length of input phrases in words (you can count half a word as a token))
MAX_OUTPUT = 64  # Enter the maximum length of predictions in tokens (you can reduce it for summarization tasks or other tasks where the output is shorter)
BATCH_SIZE = 8 
DATASET = 'UrukHan/t5-russian-summarization'  # Enter the name of the dataset

# Load the dataset. I'll describe the use of other data types below
data = load_dataset(DATASET)

# Load the model and tokenizer
tokenizer = T5TokenizerFast.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

model.config.max_length = MAX_OUTPUT  # By default, it's 20, so the output sequences of predictions are truncated in all models
# Comment this out after the first save to the repository. It's optional
tokenizer.push_to_hub(repo_name) 

train = data['train']
test = data['test'].train_test_split(0.02)['test']  # I reduced the test set so as not to wait too long for the error calculation between epochs

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)  # return_tensors="tf"

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

training_args = Seq2SeqTrainingArguments(
    output_dir=REPO,
    # overwrite_output_dir=True,
    evaluation_strategy='steps',
    # learning_rate=2e-5,
    eval_steps=5000,
    save_steps=5000,
    num_train_epochs=1,
    predict_with_generate=True,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    fp16=True,
    save_total_limit=2,
    # generation_max_length=256,
    # generation_num_beams=4,
    weight_decay=0.005,
    # logging_dir='logs',
    push_to_hub=True,
)

# Manually select the optimizer. The original T5 architecture uses the Adafactor optimizer
optimizer = Adafactor(
    model.parameters(),
    lr=1e-5,
    eps=(1e-30, 1e-3),
    clip_threshold=1.0,
    decay_rate=-0.8,
    beta1=None,
    weight_decay=0.0,
    relative_step=False,
    scale_parameter=False,
    warmup_init=False,
)
lr_scheduler = AdafactorSchedule(optimizer)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train,
    eval_dataset=test,
    optimizers=(optimizer, lr_scheduler),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

trainer.push_to_hub()

Example of Converting Arrays for the Network

input_data = ['After the start of Russia\'s special military operation to demilitarize Ukraine, the West imposed several rounds of new economic sanctions. The Kremlin called the new restrictions serious but noted that Russia had prepared for them in advance.']
output_data = ['The West imposed new sanctions against Russia']

# Tokenize the input data
task_prefix = "Spell correct: "
input_sequences = input_data 
encoding = tokenizer(
    [task_prefix + sequence for sequence in input_sequences],
    padding="longest",
    max_length=MAX_INPUT,
    truncation=True,
    return_tensors="pt",
)
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask

# Tokenize the output data
target_encoding = tokenizer(output_data, padding="longest", max_length=MAX_OUTPUT, truncation=True)
labels = target_encoding.input_ids
# Replace padding token id's of the labels by -100
labels = torch.tensor(labels)
labels[labels == tokenizer.pad_token_id] = -100

# Convert our data to the dataset format
data = Dataset.from_pandas(pd.DataFrame({'input_ids': list(np.array(input_ids)), 'attention_mask': list(np.array(attention_mask)), 'labels': list(np.array(labels))}))
data = data.train_test_split(0.02)
# And we'll get the input for our trainer: train_dataset = data['train'], eval_dataset = data['test']

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご