đ Russian Adaptation of Qwen2.5 and T-lite-it-1.0
This project is an adaptation of Qwen2.5 and T-lite-it-1.0 models for the Russian language, aiming to improve the generation speed and performance of Russian texts.
đ Quick Start
You can try out the model in the deployed Space (select the model in the parameters at the bottom):
https://huggingface.co/spaces/RefalMachine/RuadaptQwen2.5
⨠Features
- GGUF Version: Currently in progress! The current version is v1.
- Russian Adaptation: Adapted the T-lite-it-1.0 model to the Russian language. Replaced the tokenizer, then performed continued pretraining on a Russian corpus, and applied the LEP (Learned Embedding Propagation) technique.
- Improved Generation Speed: Thanks to the new tokenizer (an extended tiktoken cl100k with a unigram tokenizer of 48k tokens), the generation speed* of Russian texts has increased by up to 60% compared to the original T-lite-it-1.0 model.
- *Generation speed refers to the number of Russian characters/words per second on the same text sequences.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
No code examples are provided in the original document.
đ Documentation
Model Description
This is the GGUF version! Work in progress!!! The current version is v1.
It is an adaptation of the T-lite-it-1.0 model to the Russian language. In the model, the tokenizer was replaced, then continued pretraining was carried out on a Russian corpus, and the LEP (Learned Embedding Propagation) technique was applied.
Thanks to the new tokenizer (an extended tiktoken cl100k with a unigram tokenizer of 48k tokens), the generation speed* of Russian texts has increased by up to 60% compared to the original T-lite-it-1.0 model.
*Generation speed refers to the number of Russian characters/words per second on the same text sequences.
Tokenization


Metrics and Quality Assessment
The model was evaluated on Ru-Arena-General, MERA, and llmtf_open.
Results on Ru-Arena-General
Measurements were made using the official leaderboard code (https://github.com/VikhrModels/ru_llm_arena), but with repetition_penalty = 1.1.

Results on Shlepa

Results on MERA

Results on llmtf_open
TODO
How to cite:
Tikhomirov M., Chernyshov D. Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation //Journal of Language and Education. â 2024. â Vol. 10. â No. 4. â Pp. 130-145.
Tikhomirov M., Chernyshev D. Impact of Tokenization on LLaMa Russian Adaptation //2023 Ivannikov Ispras Open Conference (ISPRAS). â IEEE, 2023. â Pp. 163-168.
Warning
The model's responses do not reflect the opinions of the authors but merely repeat the knowledge obtained from the data at all stages of training (pretraining, tokenizer change, instruction training, answer quality calibration). The model was derived from a third - party pretrained model, and the current authors are not responsible for the pretraining of this model. No additional actions were taken to change the "opinions" embedded in the LLM when creating this version of the model. Use with caution.
đ License
This project is licensed under the Apache-2.0 license.
Property |
Details |
Datasets |
IlyaGusev/saiga_scored, IlyaGusev/saiga_preferences, dichspace/darulm |
Language |
ru |
Pipeline Tag |
text-generation |
License |
apache-2.0 |
Base Model |
Qwen/Qwen2.5 - 7B, t-tech/T-lite-it-1.0 |