🚀 kazRush-ru-kk
kazRush-ru-kk is a translation model designed for translating from Russian to Kazakh. It was trained with randomly initialized weights based on the T5 configuration using available open - source parallel data.
🚀 Quick Start
Using the model requires the sentencepiece
library to be installed. After installing the necessary dependencies, the model can be run with the following code:
💻 Usage Examples
Basic Usage
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
device = 'cuda'
model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-ru-kk').to(device)
tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-ru-kk')
@torch.inference_mode
def generate(text, **kwargs):
inputs = tokenizer(text, return_tensors='pt').to(device)
hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
print(generate("Как Кока-Кола может помочь автомобилисту?"))
Advanced Usage
You can also access the model via the pipeline wrapper:
>>> from transformers import pipeline
>>> pipe = pipeline(model="deepvk/kazRush-ru-kk")
>>> pipe("Мама мыла раму")
[{'translation_text': 'Анам жақтауды сабындады'}]
📦 Data and Training
This model was trained on the following data (Russian - Kazakh language pairs):
Preprocessing of the data included:
- Deduplication
- Removing trash symbols, special tags, multiple whitespaces etc. from texts
- Removing texts that were not in Russian or Kazakh (language detection was made via facebook/fasttext-language-identification)
- Removing pairs that had low alignment score (comparison was performed via sentence-transformers/LaBSE)
- Filtering the data using opusfilter tools
The model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.
📚 Evaluation
The current model was compared to another open - source translation model, NLLB. We compared our model to all versions of NLLB, excluding nllb - moe - 54b due to its size.
The metrics - BLEU, chrF and COMET - were calculated on the devtest
part of FLORES+ evaluation benchmark, the most recent evaluation benchmark for multilingual machine translation.
Calculation of BLEU and chrF follows the standard implementation from sacreBLEU, and COMET is calculated using the default model described in COMET repository.
Model |
Size |
BLEU |
chrF |
COMET |
[nllb - 200 - distilled - 600M](https://huggingface.co/facebook/nllb - 200 - distilled - 600M) |
600M |
13.8 |
48.2 |
86.8 |
[nllb - 200 - 1.3B](https://huggingface.co/facebook/nllb - 200 - 1.3B) |
1.3B |
14.8 |
50.1 |
88.1 |
[nllb - 200 - distilled - 1.3B](https://huggingface.co/facebook/nllb - 200 - distilled - 1.3B) |
1.3B |
15.2 |
50.2 |
88.4 |
[nllb - 200 - 3.3B](https://huggingface.co/facebook/nllb - 200 - 3.3B) |
3.3B |
15.6 |
50.7 |
88.9 |
This model |
197M |
16.2 |
51.8 |
88.3 |
💻 More Usage Examples
>>> print(generate("Каждый охотник желает знать, где сидит фазан."))
Әрбір аңшы ғибадатхананың қайда отырғанын білгісі келеді.
>>> print(generate("Местным продуктом - специальитетом с защищённым географическим наименованием по происхождению считается люнебургский степной барашек."))
Шығу тегі бойынша қорғалған географиялық атауы бар жергілікті мамандандырылған өнім болып люнебургтік дала қошқар болып саналады.
>>> print(generate("Помогите мне удивить девушку"))
Қызды таң қалдыруға көмектесіңіз
📄 License
This project is licensed under the apache-2.0
license.
📖 Citations
@misc{deepvk2024kazRushrukk,
title={kazRush-ru-kk: translation model from Russian to Kazakh},
author={Lebedeva, Anna and Sokolov, Andrey},
url={https://huggingface.co/deepvk/kazRush-ru-kk},
publisher={Hugging Face},
year={2024},
}