opus - mt - tc - bible - big - roa - deu_eng_fra_por_spa open - source translation model, a must - have for translating multiple Romance languages to multiple languages

Opus Mt Tc Bible Big Roa Deu Eng Fra Por Spa

Developed by Helsinki-NLP

This is a multi-target neural machine translation model specifically designed for translating from multiple Romance languages into German, English, French, Portuguese, and Spanish.

Machine Translation

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual Bible Translation #Romance Language Support #High Precision BLEU55.6

Downloads 25

Release Time : 10/8/2024

Model Overview

This model is part of the OPUS-MT project, which aims to provide widely accessible neural machine translation models for a variety of global languages. It supports translation from Antillean Creole, Aragonese, and other Romance languages into German, English, French, Portuguese, and Spanish.

Model Features

Multi-Target Language Support

Supports translation from multiple Romance languages into German, English, French, Portuguese, and Spanish.

High-Performance Translation

Achieves high scores of BLEU 55.6 and chr-F 0.73367 on the tatoeba-test-v2020-07-28-v2023-09-26 dataset.

Extensive Language Coverage

Supports over 40 source languages and 5 target languages, covering various Romance languages and Creoles.

Model Capabilities

Text Translation

Multilingual Support

Neural Machine Translation

Use Cases

Language Translation

Multilingual Document Translation

Translate documents from multiple Romance languages into German, English, French, Portuguese, or Spanish.

High-quality translation results suitable for business, education, and research purposes.

Cross-Language Communication

Assist users in real-time communication across different languages.

Fast and accurate translation to enhance communication efficiency.

🚀 opus-mt-tc-bible-big-roa-deu_eng_fra_por_spa

A neural machine translation model for translating from Romance languages to multiple target languages.

🚀 Quick Start

Here is a short example code to get you started with the model:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>deu<< Replace this with text in an accepted source language.",
    ">>spa<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-roa-deu_eng_fra_por_spa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-roa-deu_eng_fra_por_spa")
print(pipe(">>deu<< Replace this with text in an accepted source language."))

✨ Features

This is a multilingual translation model capable of translating from Romance languages (roa) to multiple target languages (deu, eng, fra, por, spa).
It is part of the OPUS-MT project, making neural machine translation models widely available.
All models are originally trained using the Marian NMT framework and converted to pyTorch using the transformers library by huggingface.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>deu<< Replace this with text in an accepted source language.",
    ">>spa<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-roa-deu_eng_fra_por_spa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

Advanced Usage

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-roa-deu_eng_fra_por_spa")
print(pipe(">>deu<< Replace this with text in an accepted source language."))

📚 Documentation

Model Details

Neural machine translation model for translating from Romance languages (roa) to unknown (deu+eng+fra+por+spa). This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

Property	Details
Developed by	Language Technology Research Group at the University of Helsinki
Model Type	Translation (transformer-big)
Release	2024-05-30
License	Apache-2.0
Source Language(s)	acf arg ast cat cbk cos crs egl ext fra frm fro frp fur gcf glg hat ita kea lad lij lld lmo lou mfe mol mwl nap oci osp pap pcd pms por roh ron rup scn spa srd vec wln
Target Language(s)	deu eng fra por spa
Valid Target Language Labels	>>deu<< >>eng<< >>fra<< >>por<< >>spa<< >>xxx<<
Original Model	opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
Resources for more information	OPUS-MT dashboard OPUS-MT-train GitHub Repo More information about MarianNMT models in the transformers library Tatoeba Translation Challenge HPLT bilingual data v1 (as part of the Tatoeba Translation Challenge dataset) A massively parallel Bible corpus

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

⚠️ Important Note

Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

Training

Data: opusTCv20230926max50+bt+jhubc (source)
Pre-processing: SentencePiece (spm32k,spm32k)
Model Type: transformer-big
Original MarianNMT Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
Training Scripts: GitHub Repo

Evaluation

Model scores at the OPUS-MT dashboard
test set translations: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.test.txt
test set scores: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.eval.txt
benchmark results: benchmark_results.txt
benchmark output: benchmark_translations.zip

langpair	testset	chr-F	BLEU	#sent	#words
multi-multi	tatoeba-test-v2020-07-28-v2023-09-26	0.73367	55.6	10000	83852

Citation Information

Publications: Democratizing neural machine translation with OPUS-MT and OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

transformers version: 4.45.1
OPUS-MT git hash: 0882077
port time: Tue Oct 8 15:15:30 EEST 2024
port machine: LM0-400-22516.local

📄 License

This model is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご