đ English - Turkic Languages Translation Model
This project focuses on the translation from English to Turkic languages. It provides a Transformer - based model with specific pre - processing steps and offers detailed evaluation scores on various test sets.
đ Quick Start
⨠Features
- Language Coverage: Supports a wide range of Turkic languages as target languages, including aze_Latn, bak, chv, etc.
- Model Architecture: Utilizes the Transformer model for translation.
- Pre - processing: Applies normalization and SentencePiece (spm32k, spm32k) for pre - processing.
đĻ Installation
The README does not provide specific installation commands, so this section is skipped.
đģ Usage Examples
The README does not provide code examples, so this section is skipped.
đ Documentation
Model Details
- Source Group: English
- Target Group: Turkic languages
- OPUS readme: [eng - trk](https://github.com/Helsinki - NLP/Tatoeba - Challenge/tree/master/models/eng - trk/README.md)
- Model: Transformer
- Source Language(s): eng
- Target Language(s): aze_Latn, bak, chv, crh, crh_Latn, kaz_Cyrl, kaz_Latn, kir_Cyrl, kjh, kum, ota_Arab, ota_Latn, sah, tat, tat_Arab, tat_Latn, tuk, tuk_Latn, tur, tyv, uig_Arab, uig_Cyrl, uzb_Cyrl, uzb_Latn
- Pre - processing: normalization + SentencePiece (spm32k, spm32k)
System Info
Property |
Details |
hf_name |
eng - trk |
source_languages |
eng |
target_languages |
trk |
opus_readme_url |
https://github.com/Helsinki - NLP/Tatoeba - Challenge/tree/master/models/eng - trk/README.md |
original_repo |
Tatoeba - Challenge |
tags |
['translation'] |
languages |
['en', 'tt', 'cv', 'tk', 'tr', 'ba', 'trk'] |
src_constituents |
{'eng'} |
tgt_constituents |
{'kir_Cyrl', 'tat_Latn', 'tat', 'chv', 'uzb_Cyrl', 'kaz_Latn', 'aze_Latn', 'crh', 'kjh', 'uzb_Latn', 'ota_Arab', 'tuk_Latn', 'tuk', 'tat_Arab', 'sah', 'tyv', 'tur', 'uig_Arab', 'crh_Latn', 'kaz_Cyrl', 'uig_Cyrl', 'kum', 'ota_Latn', 'bak'} |
src_multilingual |
False |
tgt_multilingual |
True |
prepro |
normalization + SentencePiece (spm32k, spm32k) |
url_model |
https://object.pouta.csc.fi/Tatoeba - MT - models/eng - trk/opus2m - 2020 - 08 - 01.zip |
url_test_set |
https://object.pouta.csc.fi/Tatoeba - MT - models/eng - trk/opus2m - 2020 - 08 - 01.test.txt |
src_alpha3 |
eng |
tgt_alpha3 |
trk |
short_pair |
en - trk |
chrF2_score |
0.455 |
bleu |
19.9 |
brevity_penalty |
1.0 |
ref_len |
57072.0 |
src_name |
English |
tgt_name |
Turkic languages |
train_date |
2020 - 08 - 01 |
src_alpha2 |
en |
tgt_alpha2 |
trk |
prefer_old |
False |
long_pair |
eng - trk |
helsinki_git_sha |
480fcbe0ee1bf4774bcbe6226ad9f58e63f6c535 |
transformers_git_sha |
2207e5d8cb224e954a7cba69fa4ac2309e9ff30b |
port_machine |
brutasse |
port_time |
2020 - 08 - 21 - 14:41 |
Benchmarks
testset |
BLEU |
chr - F |
newsdev2016 - entr - engtur.eng.tur |
10.1 |
0.437 |
newstest2016 - entr - engtur.eng.tur |
9.2 |
0.410 |
newstest2017 - entr - engtur.eng.tur |
9.0 |
0.410 |
newstest2018 - entr - engtur.eng.tur |
9.2 |
0.413 |
Tatoeba - test.eng - aze.eng.aze |
26.8 |
0.577 |
Tatoeba - test.eng - bak.eng.bak |
7.6 |
0.308 |
Tatoeba - test.eng - chv.eng.chv |
4.3 |
0.270 |
Tatoeba - test.eng - crh.eng.crh |
8.1 |
0.330 |
Tatoeba - test.eng - kaz.eng.kaz |
11.1 |
0.359 |
Tatoeba - test.eng - kir.eng.kir |
28.6 |
0.524 |
Tatoeba - test.eng - kjh.eng.kjh |
1.0 |
0.041 |
Tatoeba - test.eng - kum.eng.kum |
2.2 |
0.075 |
Tatoeba - test.eng.multi |
19.9 |
0.455 |
Tatoeba - test.eng - ota.eng.ota |
0.5 |
0.065 |
Tatoeba - test.eng - sah.eng.sah |
0.7 |
0.030 |
Tatoeba - test.eng - tat.eng.tat |
9.7 |
0.316 |
Tatoeba - test.eng - tuk.eng.tuk |
5.9 |
0.317 |
Tatoeba - test.eng - tur.eng.tur |
34.6 |
0.623 |
Tatoeba - test.eng - tyv.eng.tyv |
5.4 |
0.210 |
Tatoeba - test.eng - uig.eng.uig |
0.1 |
0.155 |
Tatoeba - test.eng - uzb.eng.uzb |
3.4 |
0.275 |
đ§ Technical Details
The README does not provide specific technical implementation details, so this section is skipped.
đ License
This project is licensed under the Apache - 2.0 license.