🚀 FRED-T5 1.7B (Full-scale Russian Enhanced Denoisers T5)
FRED-T5 1.7B is a powerful language model for the Russian language, based on the T5 architecture. It offers high - performance language processing capabilities trained on a large - scale Russian corpus.
The model architecture design, pretraining, and evaluation are documented in our preprint: A Family of Pretrained Transformer Language Models for Russian.
The model was trained by SberDevices.
📋 Model Information
Property |
Details |
Model Type |
Based on T5 architecture |
Layers |
24 layers |
Hidden Size |
1536 |
Training Data |
Trained on a 300GB Russian language corpus, same as ruT5 models |
Tokenizer |
Bbpe tokenizer, 50257 + 107 special tokens. Prefix tokens: '<LM>', '<SC1>',.. '<SC6>' |
Training Process |
First half of the time, trained on 1% (3GB) of the dataset without prefixes in each task. For RSG, trained as described in the T5 paper (first multitask for all tasks, then took the best checkpoint for the task and trained further). RSG submit here https://russiansuperglue.com/login/submit_info/1936 |
Total Training Time |
Around 45 days on 112 A100 GPUs |
🚀 Quick Start
The model is based on the T5 architecture, with specific training processes and configurations. More details can be found in the config.json
file.
💻 Usage Examples
Basic Usage
import torch
from transformers import GPT2Tokenizer, T5ForConditionalGeneration
tokenizer = GPT2Tokenizer.from_pretrained('ai-forever/FRED-T5-1.7B',eos_token='</s>')
model = T5ForConditionalGeneration.from_pretrained('ai-forever/FRED-T5-1.7B')
device='cuda'
model.to(device)
lm_text='<LM>Принялся Кутузов рассказывать свою историю как он сюда попал. Началось'
input_ids=torch.tensor([tokenizer.encode(lm_text)]).to(device)
outputs=model.generate(input_ids,eos_token_id=tokenizer.eos_token_id,early_stopping=True)
print(tokenizer.decode(outputs[0][1:]))
lm_text='<SC1>Принялся Кутузов рассказывать свою историю <extra_id_0>. Началось с того, что он был в армии, служил в артиллерии.'
input_ids=torch.tensor([tokenizer.encode(lm_text)]).to(device)
outputs=model.generate(input_ids,eos_token_id=tokenizer.eos_token_id,early_stopping=True)
print(tokenizer.decode(outputs[0][1:]))
lm_text='<SC5>Принялся Кутузов рассказывать свою историю <extra_id_0>. Началось с того, что он был в армии, служил в артиллерии.'
input_ids=torch.tensor([tokenizer.encode(lm_text)]).to(device)
outputs=model.generate(input_ids,eos_token_id=tokenizer.eos_token_id,early_stopping=True)
tokenizer.decode(outputs[0][1:])
📚 Documentation
The model was trained on a mixture of 7 denoisers similar to UL2 with several differences (https://arxiv.org/abs/2205.05131).
👨💻 Authors
- NLP core team RnD Telegram channel:
- Dmitry Zmitrovich
- Andrei Kalmykov
- Vitaly Kadulin
- Mikhail Novikov
- Alexey Khoroshilov
Salute AI Community.
📄 License
This project is licensed under the Apache-2.0 license.
📖 Cite us
@misc{zmitrovich2023family,
title={A Family of Pretrained Transformer Language Models for Russian},
author={Dmitry Zmitrovich and Alexander Abramov and Andrey Kalmykov and Maria Tikhonova and Ekaterina Taktasheva and Danil Astafurov and Mark Baushenko and Artem Snegirev and Tatiana Shavrina and Sergey Markov and Vladislav Mikhailov and Alena Fenogenova},
year={2023},
eprint={2309.10931},
archivePrefix={arXiv},
primaryClass={cs.CL}
}