FRED-T5-large Open Source Russian Language Model - Free Support for Multiple Text Generation Tasks

FRED T5 Large

Developed by ai-forever

A Russian pre-trained language model based on the T5 architecture, employing a mixed training strategy with 7 denoisers similar to UL2, supporting various text generation tasks.

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Russian text generation #Multi-task denoiser #Large-scale pre-training

Downloads 998

Release Time : 2/28/2023

Model Overview

A pre-trained Transformer language model for Russian, primarily used for text generation and denoising tasks, supporting multiple prefix tokens to control generated content.

Model Features

Multi-task denoising training

Employs a mixed training strategy with 7 denoisers similar to UL2, enhancing the model's ability to handle noisy text.

Prefix token control

Supports multiple prefix tokens (e.g., <LM>, <SC1>-<SC6>) to control generated content and task types.

Large-scale Russian training

Trained on a 300GB Russian corpus, using the same dataset as the ruT5 model.

Model Capabilities

Russian text generation

Text denoising

Prefix-controlled generation

Story continuation

Text completion

Use Cases

Text generation

Story continuation

Uses the <LM> prefix for open-ended text generation.

The model can coherently continue a story based on a given beginning.

Text completion

Uses the <SC1> prefix for text completion tasks.

The model can accurately predict and complete missing text segments.

Denoising

Noisy text restoration

Processes text inputs containing noise or missing segments.

The model can effectively restore the original text content.

🚀 FRED-T5 large 820M (Full-scale Russian Enhanced Denoisers T5)

This model provides a set of pre - trained Transformer language models for the Russian language. It is based on the T5 architecture and trained on a large - scale Russian corpus, offering high - quality language processing capabilities for Russian text.

🚀 Quick Start

The design, pre - training, and evaluation of the model architecture are detailed in our preprint: A Family of Pretrained Transformer Language Models for Russian.

The model was trained by SberDevices.

✨ Features

Architecture: Based on the T5 architecture, it has 24 layers and a hidden size of 1024. More details can be found in config.json.
Training Strategy: Trained on a mixture of 7 denoisers similar to UL2 with several differences (https://arxiv.org/abs/2205.05131).
Training Data: Trained on a 300GB Russian language corpus, the same dataset as used for ruT5 models.
Tokenizer: Uses a Bbpe tokenizer with 50257 + 107 special tokens. Prefix tokens include '<LM>', '<SC1>',.. '<SC6>'.
Training Process: In the first half of the training time, the model was trained on a small part (1%, 3GB) of the entire dataset without prefixes in each task.
RSG Training: For RSG, the training was carried out as described in the T5 paper. First, multitask training was performed for all tasks. Then, the best checkpoint for each task was selected and further trained. The RSG submission can be found here: https://russiansuperglue.com/login/submit_info/2060.
Training Time: The total training time was approximately 35 days on 160 V100 GPUs and 5 days on 80 A100 GPUs.

💻 Usage Examples

Basic Usage

import torch
from transformers import GPT2Tokenizer, T5ForConditionalGeneration 
tokenizer = GPT2Tokenizer.from_pretrained('ai-forever/FRED-T5-1.7B',eos_token='</s>')
model = T5ForConditionalGeneration.from_pretrained('ai-forever/FRED-T5-1.7B')
device='cuda'
model.to(device)

#Prefix <LM>
lm_text='<LM>Принялся Кутузов рассказывать свою историю как он сюда попал. Началось'
input_ids=torch.tensor([tokenizer.encode(lm_text)]).to(device)
outputs=model.generate(input_ids,eos_token_id=tokenizer.eos_token_id,early_stopping=True)
print(tokenizer.decode(outputs[0][1:]))

# print result: , как водится, с того, что он был в плену.</s>

#Prefix <SC1>
lm_text='<SC1>Принялся Кутузов рассказывать свою историю <extra_id_0>. Началось с того, что он был в армии, служил в артиллерии.'
input_ids=torch.tensor([tokenizer.encode(lm_text)]).to(device)
outputs=model.generate(input_ids,eos_token_id=tokenizer.eos_token_id,early_stopping=True)
print(tokenizer.decode(outputs[0][1:]))

#print result: '<extra_id_0>, как он жил</s>'

# Prefix <SC5>
lm_text='<SC5>Принялся Кутузов рассказывать свою историю <extra_id_0>. Началось с того, что он был в армии, служил в артиллерии.'
input_ids=torch.tensor([tokenizer.encode(lm_text)]).to(device)
outputs=model.generate(input_ids,eos_token_id=tokenizer.eos_token_id,early_stopping=True,max_length=100)
print(tokenizer.decode(outputs[0][1:]))

#print result: '<extra_id_0> </s>'

📚 Documentation

Authors

NLP core team RnD Telegram channel:
- Dmitry Zmitrovich
- Andrei Kalmykov
- Vitaly Kadulin
- Mikhail Novikov
- Alexey Khoroshilov

Salute AI Community.

Cite us

@misc{zmitrovich2023family,
      title={A Family of Pretrained Transformer Language Models for Russian}, 
      author={Dmitry Zmitrovich and Alexander Abramov and Andrey Kalmykov and Maria Tikhonova and Ekaterina Taktasheva and Danil Astafurov and Mark Baushenko and Artem Snegirev and Tatiana Shavrina and Sergey Markov and Vladislav Mikhailov and Alena Fenogenova},
      year={2023},
      eprint={2309.10931},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

📄 License

This model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご