🚀 kazRush-kk-ru
kazRush-kk-ru is a translation model designed to translate from Kazakh to Russian. It was trained with randomly initialized weights based on the T5 configuration using available open - source parallel data.
🚀 Quick Start
Using the model requires the sentencepiece
library to be installed. After installing the necessary dependencies, the model can be run with the following code:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
device = 'cuda'
model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-kk-ru').to(device)
tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-kk-ru')
@torch.inference_mode
def generate(text, **kwargs):
inputs = tokenizer(text, return_tensors='pt').to(device)
hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
print(generate("Анам жақтауды жуды."))
You can also access the model via the pipeline wrapper:
>>> from transformers import pipeline
>>> pipe = pipeline(model="deepvk/kazRush-kk-ru")
>>> pipe("Иттерді кім шығарды?")
[{'translation_text': 'Кто выпустил собак?'}]
✨ Features
- Translation Capability: Capable of translating text from Kazakh to Russian.
- Based on T5: Trained with randomly initialized weights based on the T5 configuration.
📦 Installation
The sentencepiece
library is required for using this model.
💻 Usage Examples
Basic Usage
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
device = 'cuda'
model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-kk-ru').to(device)
tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-kk-ru')
@torch.inference_mode
def generate(text, **kwargs):
inputs = tokenizer(text, return_tensors='pt').to(device)
hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
print(generate("Анам жақтауды жуды."))
Advanced Usage
>>> from transformers import pipeline
>>> pipe = pipeline(model="deepvk/kazRush-kk-ru")
>>> pipe("Иттерді кім шығарды?")
[{'translation_text': 'Кто выпустил собак?'}]
More Usage Examples
>>> print(generate("Балық көбінесе сулардағы токсиндердің жоғары концентрацияларына байланысты өледі."))
Рыба часто умирает из-за высоких концентраций токсинов в воде.
>>> print(generate("Өткен 3 айда 80-нен астам қамалушы ресми түрде айып тағылмастан изолятордан шығарылды."))
За прошедшие 3 месяца более 80 арестованных были официально извлечены из изолятора без обвинения.
>>> print(generate("Бұл тастардың он бесі өткен шілде айындағы метеориттік жаңбырға жатқызылады."))
Пятнадцать этих камней относят к метеоритным дождям прошлого июля.
📚 Documentation
Data and Training
This model was trained on the following data (Russian - Kazakh language pairs):
Preprocessing of the data included:
- Deduplication
- Removing trash symbols, special tags, multiple whitespaces etc. from texts
- Removing texts that were not in Russian or Kazakh (language detection was made via facebook/fasttext-language-identification)
- Removing pairs that had low alignment score (comparison was performed via sentence-transformers/LaBSE)
- Filtering the data using opusfilter tools
The model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.
Evaluation
The current model was compared to another open - source translation model, NLLB. We compared our model to all versions of NLLB, excluding nllb - moe - 54b due to its size.
The metrics - BLEU, chrF and COMET - were calculated on the devtest
part of FLORES+ evaluation benchmark, the most recent evaluation benchmark for multilingual machine translation.
Calculation of BLEU and chrF follows the standard implementation from sacreBLEU, and COMET is calculated using the default model described in COMET repository.
📄 License
This project is licensed under the Apache - 2.0 license.
📚 Citations
@misc{deepvk2024kazRushkkru,
title={kazRush-kk-ru: translation model from Kazakh to Russian},
author={Lebedeva, Anna and Sokolov, Andrey},
url={https://huggingface.co/deepvk/kazRush-kk-ru},
publisher={Hugging Face},
year={2024},
}