kazRush-kk-ru Open-source Translation Model - Achieve High-quality Free Translation from Kazakh to Russian

Kazrush Kk Ru

Developed by deepvk

kazRush-kk-ru is a Kazakh-to-Russian translation model based on the T5 configuration, trained on multiple parallel datasets.

Machine Translation

Transformers

OtherOpen Source License:Apache-2.0 #Kazakh-Russian translation #T5 architecture optimization #Multi-source data training

Downloads 2,630

Release Time : 10/31/2024

Model Overview

This model is specifically designed for translating Kazakh text into Russian. Based on the T5 architecture and trained on extensive parallel corpora, it supports high-quality translation tasks.

Model Features

High-quality translation

Performs excellently on multiple evaluation metrics, especially in Kazakh-to-Russian translation tasks.

Multi-dataset training

Trained on multiple high-quality parallel datasets, including OPUS Corpora, kazparc, wmt19 dataset, and TIL dataset.

Efficient inference

Moderate model size, suitable for efficient inference in practical applications.

Model Capabilities

Text translation from Kazakh to Russian

Use Cases

Text translation

News translation

Translate Kazakh news content into Russian.

High-quality translation results that preserve the original semantics.

Document translation

Translate Kazakh official documents or academic papers into Russian.

Accurate conveyance of technical terms and complex sentence structures.

🚀 kazRush-kk-ru

kazRush-kk-ru is a translation model designed to translate from Kazakh to Russian. It was trained with randomly initialized weights based on the T5 configuration using available open - source parallel data.

🚀 Quick Start

Using the model requires the sentencepiece library to be installed. After installing the necessary dependencies, the model can be run with the following code:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

device = 'cuda'
model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-kk-ru').to(device)
tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-kk-ru')

@torch.inference_mode
def generate(text, **kwargs):
    inputs = tokenizer(text, return_tensors='pt').to(device)
    hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
    return tokenizer.decode(hypotheses[0], skip_special_tokens=True)

print(generate("Анам жақтауды жуды."))

You can also access the model via the pipeline wrapper:

>>> from transformers import pipeline

>>> pipe = pipeline(model="deepvk/kazRush-kk-ru")
>>> pipe("Иттерді кім шығарды?")
[{'translation_text': 'Кто выпустил собак?'}]

✨ Features

Translation Capability: Capable of translating text from Kazakh to Russian.
Based on T5: Trained with randomly initialized weights based on the T5 configuration.

📦 Installation

The sentencepiece library is required for using this model.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

device = 'cuda'
model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-kk-ru').to(device)
tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-kk-ru')

@torch.inference_mode
def generate(text, **kwargs):
    inputs = tokenizer(text, return_tensors='pt').to(device)
    hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
    return tokenizer.decode(hypotheses[0], skip_special_tokens=True)

print(generate("Анам жақтауды жуды."))

Advanced Usage

>>> from transformers import pipeline

>>> pipe = pipeline(model="deepvk/kazRush-kk-ru")
>>> pipe("Иттерді кім шығарды?")
[{'translation_text': 'Кто выпустил собак?'}]

More Usage Examples

>>> print(generate("Балық көбінесе сулардағы токсиндердің жоғары концентрацияларына байланысты өледі."))
Рыба часто умирает из-за высоких концентраций токсинов в воде.

>>> print(generate("Өткен 3 айда 80-нен астам қамалушы ресми түрде айып тағылмастан изолятордан шығарылды."))
За прошедшие 3 месяца более 80 арестованных были официально извлечены из изолятора без обвинения.

>>> print(generate("Бұл тастардың он бесі өткен шілде айындағы метеориттік жаңбырға жатқызылады."))
Пятнадцать этих камней относят к метеоритным дождям прошлого июля.

📚 Documentation

Data and Training

This model was trained on the following data (Russian - Kazakh language pairs):

Property	Details
OPUS Corpora	718K pairs
kazparc	2,150K pairs
wmt19 dataset	5,063K pairs
TIL dataset	4,403K pairs

Preprocessing of the data included:

Deduplication
Removing trash symbols, special tags, multiple whitespaces etc. from texts
Removing texts that were not in Russian or Kazakh (language detection was made via facebook/fasttext-language-identification)
Removing pairs that had low alignment score (comparison was performed via sentence-transformers/LaBSE)
Filtering the data using opusfilter tools

The model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.

Evaluation

The current model was compared to another open - source translation model, NLLB. We compared our model to all versions of NLLB, excluding nllb - moe - 54b due to its size. The metrics - BLEU, chrF and COMET - were calculated on the devtest part of FLORES+ evaluation benchmark, the most recent evaluation benchmark for multilingual machine translation.
Calculation of BLEU and chrF follows the standard implementation from sacreBLEU, and COMET is calculated using the default model described in COMET repository.

Model	Size	BLEU	chrf	COMET
nllb-200-distilled-600M	600M	18.0	47.3	85.6
This model	197M	18.8	48.7	86.7
nllb-200-1.3B	1.3B	20.4	49.3	87.9
nllb-200-distilled-1.3B	1.3B	20.8	49.6	88.1
nllb-200-3.3B	3.3B	21.5	50.7	88.7

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Citations

@misc{deepvk2024kazRushkkru,
    title={kazRush-kk-ru: translation model from Kazakh to Russian},
    author={Lebedeva, Anna and  Sokolov, Andrey},
    url={https://huggingface.co/deepvk/kazRush-kk-ru},
    publisher={Hugging Face},
    year={2024},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご