🚀 FRED-T5 large 820M (Full-scale Russian Enhanced Denoisers T5)
This model provides a set of pre - trained Transformer language models for the Russian language. It is based on the T5 architecture and trained on a large - scale Russian corpus, offering high - quality language processing capabilities for Russian text.
🚀 Quick Start
The design, pre - training, and evaluation of the model architecture are detailed in our preprint: A Family of Pretrained Transformer Language Models for Russian.
The model was trained by SberDevices.
✨ Features
- Architecture: Based on the T5 architecture, it has 24 layers and a hidden size of 1024. More details can be found in
config.json
.
- Training Strategy: Trained on a mixture of 7 denoisers similar to UL2 with several differences (https://arxiv.org/abs/2205.05131).
- Training Data: Trained on a 300GB Russian language corpus, the same dataset as used for ruT5 models.
- Tokenizer: Uses a Bbpe tokenizer with 50257 + 107 special tokens. Prefix tokens include '<LM>', '<SC1>',.. '<SC6>'.
- Training Process: In the first half of the training time, the model was trained on a small part (1%, 3GB) of the entire dataset without prefixes in each task.
- RSG Training: For RSG, the training was carried out as described in the T5 paper. First, multitask training was performed for all tasks. Then, the best checkpoint for each task was selected and further trained. The RSG submission can be found here: https://russiansuperglue.com/login/submit_info/2060.
- Training Time: The total training time was approximately 35 days on 160 V100 GPUs and 5 days on 80 A100 GPUs.
💻 Usage Examples
Basic Usage
import torch
from transformers import GPT2Tokenizer, T5ForConditionalGeneration
tokenizer = GPT2Tokenizer.from_pretrained('ai-forever/FRED-T5-1.7B',eos_token='</s>')
model = T5ForConditionalGeneration.from_pretrained('ai-forever/FRED-T5-1.7B')
device='cuda'
model.to(device)
lm_text='<LM>Принялся Кутузов рассказывать свою историю как он сюда попал. Началось'
input_ids=torch.tensor([tokenizer.encode(lm_text)]).to(device)
outputs=model.generate(input_ids,eos_token_id=tokenizer.eos_token_id,early_stopping=True)
print(tokenizer.decode(outputs[0][1:]))
lm_text='<SC1>Принялся Кутузов рассказывать свою историю <extra_id_0>. Началось с того, что он был в армии, служил в артиллерии.'
input_ids=torch.tensor([tokenizer.encode(lm_text)]).to(device)
outputs=model.generate(input_ids,eos_token_id=tokenizer.eos_token_id,early_stopping=True)
print(tokenizer.decode(outputs[0][1:]))
lm_text='<SC5>Принялся Кутузов рассказывать свою историю <extra_id_0>. Началось с того, что он был в армии, служил в артиллерии.'
input_ids=torch.tensor([tokenizer.encode(lm_text)]).to(device)
outputs=model.generate(input_ids,eos_token_id=tokenizer.eos_token_id,early_stopping=True,max_length=100)
print(tokenizer.decode(outputs[0][1:]))
📚 Documentation
Authors
- NLP core team RnD Telegram channel:
- Dmitry Zmitrovich
- Andrei Kalmykov
- Vitaly Kadulin
- Mikhail Novikov
- Alexey Khoroshilov
Salute AI Community.
Cite us
@misc{zmitrovich2023family,
title={A Family of Pretrained Transformer Language Models for Russian},
author={Dmitry Zmitrovich and Alexander Abramov and Andrey Kalmykov and Maria Tikhonova and Ekaterina Taktasheva and Danil Astafurov and Mark Baushenko and Artem Snegirev and Tatiana Shavrina and Sergey Markov and Vladislav Mikhailov and Alena Fenogenova},
year={2023},
eprint={2309.10931},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
📄 License
This model is licensed under the Apache - 2.0 license.