kazRush-ru-kk Open Source Translation Model - Free and Accurate Translation from Russian to Kazakh

Kazrush Ru Kk

Developed by deepvk

kazRush-ru-kk is a Russian-to-Kazakh translation model based on the T5 configuration, trained on multiple open-source parallel datasets.

Machine Translation

Transformers

OtherOpen Source License:Apache-2.0 #Russian-Kazakh Translation #T5 Architecture #Multi-dataset Training

Downloads 332

Release Time : 11/7/2024

Model Overview

This model is specifically designed for translating Russian text into Kazakh. Built on the T5 architecture and trained on large-scale parallel data, it outperforms some NLLB models in performance.

Model Features

High-performance Translation

Outperforms multiple NLLB model versions in BLEU and chrF metrics.

Multi-source Data Training

Integrates high-quality parallel datasets including OPUS Corpora, kazparc, wmt19, and TIL.

Strict Data Filtering

Ensures training data quality through various techniques, including deduplication, language detection, and sentence alignment scoring.

Model Capabilities

Russian-to-Kazakh Translation

Text Generation

Use Cases

Language Translation

Daily Conversation Translation

Translating everyday Russian conversations into Kazakh

Example: 'Помогите мне удивить девушку' → 'Қызды таң қалдыруға көмектесіңіз'

Technical Term Translation

Handling translations of texts containing technical terms

Example: Accurate translation of geographically protected product names

🚀 kazRush-ru-kk

kazRush-ru-kk is a translation model designed for translating from Russian to Kazakh. It was trained with randomly initialized weights based on the T5 configuration using available open - source parallel data.

🚀 Quick Start

Using the model requires the sentencepiece library to be installed. After installing the necessary dependencies, the model can be run with the following code:

💻 Usage Examples

Basic Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

device = 'cuda'
model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-ru-kk').to(device)
tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-ru-kk')

@torch.inference_mode
def generate(text, **kwargs):
    inputs = tokenizer(text, return_tensors='pt').to(device)
    hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
    return tokenizer.decode(hypotheses[0], skip_special_tokens=True)

print(generate("Как Кока-Кола может помочь автомобилисту?"))

Advanced Usage

You can also access the model via the pipeline wrapper:

>>> from transformers import pipeline

>>> pipe = pipeline(model="deepvk/kazRush-ru-kk")
>>> pipe("Мама мыла раму")
[{'translation_text': 'Анам жақтауды сабындады'}]

📦 Data and Training

This model was trained on the following data (Russian - Kazakh language pairs):

Property	Details
Dataset 1	OPUS Corpora, with 718K pairs
Dataset 2	kazparc, with 2,150K pairs
Dataset 3	wmt19 dataset, with 5,063K pairs
Dataset 4	TIL dataset, with 4,403K pairs

Preprocessing of the data included:

Deduplication
Removing trash symbols, special tags, multiple whitespaces etc. from texts
Removing texts that were not in Russian or Kazakh (language detection was made via facebook/fasttext-language-identification)
Removing pairs that had low alignment score (comparison was performed via sentence-transformers/LaBSE)
Filtering the data using opusfilter tools

The model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.

📚 Evaluation

The current model was compared to another open - source translation model, NLLB. We compared our model to all versions of NLLB, excluding nllb - moe - 54b due to its size. The metrics - BLEU, chrF and COMET - were calculated on the devtest part of FLORES+ evaluation benchmark, the most recent evaluation benchmark for multilingual machine translation. Calculation of BLEU and chrF follows the standard implementation from sacreBLEU, and COMET is calculated using the default model described in COMET repository.

Model	Size	BLEU	chrF	COMET
[nllb - 200 - distilled - 600M](https://huggingface.co/facebook/nllb - 200 - distilled - 600M)	600M	13.8	48.2	86.8
[nllb - 200 - 1.3B](https://huggingface.co/facebook/nllb - 200 - 1.3B)	1.3B	14.8	50.1	88.1
[nllb - 200 - distilled - 1.3B](https://huggingface.co/facebook/nllb - 200 - distilled - 1.3B)	1.3B	15.2	50.2	88.4
[nllb - 200 - 3.3B](https://huggingface.co/facebook/nllb - 200 - 3.3B)	3.3B	15.6	50.7	88.9
This model	197M	16.2	51.8	88.3

💻 More Usage Examples

>>> print(generate("Каждый охотник желает знать, где сидит фазан."))
Әрбір аңшы ғибадатхананың қайда отырғанын білгісі келеді.

>>> print(generate("Местным продуктом - специальитетом с защищённым географическим наименованием по происхождению считается люнебургский степной барашек."))
Шығу тегі бойынша қорғалған географиялық атауы бар жергілікті мамандандырылған өнім болып люнебургтік дала қошқар болып саналады.

>>> print(generate("Помогите мне удивить девушку"))
Қызды таң қалдыруға көмектесіңіз

📄 License

This project is licensed under the apache-2.0 license.

📖 Citations

@misc{deepvk2024kazRushrukk,
    title={kazRush-ru-kk: translation model from Russian to Kazakh},
    author={Lebedeva, Anna and  Sokolov, Andrey},
    url={https://huggingface.co/deepvk/kazRush-ru-kk},
    publisher={Hugging Face},
    year={2024},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご