🚀 opus-mt-tc-big-en-fi
A neural machine translation model designed for translating text from English (en) to Finnish (fi). It's part of a broader initiative to make high - quality translation models accessible globally.
🚀 Quick Start
This is a neural machine translation model for translating from English to Finnish. It's part of the OPUS - MT project, aiming to offer accessible neural machine translation models for multiple languages. The model is initially trained with Marian NMT and then converted to pyTorch using the transformers library by huggingface.
✨ Features
- Multilingual Support: A multilingual translation model with multiple target languages. A sentence - initial language token in the form of
>>id<<
(id = valid target language ID), e.g., >>fin<<
, is required.
- Based on OPUS: Training data is sourced from OPUS, and training pipelines follow the procedures of OPUS - MT - train.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
from transformers import MarianMTModel, MarianTokenizer
src_text = [
"Russia is big.",
"Touch wood!"
]
model_name = "pytorch-models/opus-mt-tc-big-en-fi"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
for t in translated:
print( tokenizer.decode(t, skip_special_tokens=True) )
Advanced Usage
from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-en-fi")
print(pipe("Russia is big."))
📚 Documentation
Model Info
Publications
Please cite the following publications if you use this model:
- [OPUS - MT – Building open translation services for the World](https://aclanthology.org/2020.eamt - 1.61/)
- [The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT](https://aclanthology.org/2020.wmt - 1.139/)
@inproceedings{tiedemann-thottingal-2020-opus,
title = "{OPUS}-{MT} {--} Building open translation services for the World",
author = {Tiedemann, J{\"o}rg and Thottingal, Santhosh},
booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
month = nov,
year = "2020",
address = "Lisboa, Portugal",
publisher = "European Association for Machine Translation",
url = "https://aclanthology.org/2020.eamt-1.61",
pages = "479--480",
}
@inproceedings{tiedemann-2020-tatoeba,
title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
author = {Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the Fifth Conference on Machine Translation",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.wmt-1.139",
pages = "1174--1182",
}
Benchmarks
langpair |
testset |
chr - F |
BLEU |
#sent |
#words |
eng - fin |
tatoeba - test - v2021 - 08 - 07 |
0.64352 |
39.3 |
10690 |
65122 |
eng - fin |
flores101 - devtest |
0.61334 |
27.6 |
1012 |
18781 |
eng - fin |
newsdev2015 |
0.58367 |
24.2 |
1500 |
23091 |
eng - fin |
newstest2015 |
0.60080 |
26.4 |
1370 |
19735 |
eng - fin |
newstest2016 |
0.61636 |
28.8 |
3000 |
47678 |
eng - fin |
newstest2017 |
0.64381 |
31.3 |
3002 |
45269 |
eng - fin |
newstest2018 |
0.55626 |
19.7 |
3000 |
44836 |
eng - fin |
newstest2019 |
0.58420 |
26.4 |
1997 |
38369 |
eng - fin |
newstestB2016 |
0.57554 |
23.3 |
3000 |
45766 |
eng - fin |
newstestB2017 |
0.60212 |
26.8 |
3002 |
45506 |
🔧 Technical Details
The model is trained using the Marian NMT framework, which is an efficient NMT implementation in pure C++. The training data is sourced from OPUS, and the training pipelines follow the procedures of OPUS - MT - train. After training, the model is converted to pyTorch using the transformers library by huggingface.
📄 License
This model is released under the cc - by - 4.0 license.
Acknowledgements
The development of this model is supported by multiple projects:
- European Language Grid as pilot project 2866.
- [FoTran project](https://www.helsinki.fi/en/researchgroups/natural - language - understanding - with - cross - lingual - grounding), funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771113).
- MeMAD project, funded by the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No 780069.
We also appreciate the computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland.
Model conversion info
- Transformers Version: 4.16.2
- OPUS - MT Git Hash: f084bad
- Port Time: Tue Mar 22 14:42:32 EET 2022
- Port Machine: LM0 - 400 - 22516.local