Opus-mt-tc-bible-big-roa-en Open-source Translation Model - Free Translation from Romance Languages to English

Opus Mt Tc Bible Big Roa En

Developed by Helsinki-NLP

This is a neural machine translation model for translating Romance (roa) languages into English (en), which is part of the OPUS-MT project.

Machine Translation

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Romance language translation #Bible corpus training #Multilingual support

Downloads 2,985

Release Time : 10/8/2024

Model Overview

This model is specifically designed to translate multiple Romance languages into English. It is trained based on the Transformer architecture and is suitable for text translation tasks.

Model Features

Multilingual support

Supports translation from multiple Romance languages to English

High-quality translation

Trained on the OPUS dataset, providing high-quality translation results

Easy integration

Can be easily integrated into applications through the Hugging Face Transformers library

Model Capabilities

Text translation

Multilingual processing

Use Cases

Language translation

Document translation

Translate documents in Romance languages into English

High-quality English translation results

Real-time translation

Used for translation services in real-time chats or meetings

Fast and accurate translation responses

🚀 opus-mt-tc-bible-big-roa-en

A neural machine translation model for translating from Romance languages (roa) to English (en)

🚀 Quick Start

Basic Usage

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    "É caro demais.",
    "Estamos muertos."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-roa-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     It's too expensive.
#     We're dead.

Advanced Usage

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-roa-en")
print(pipe("É caro demais."))

# expected output: It's too expensive.

✨ Features

Multilingual Support: This model supports translation from multiple Romance languages (acf, an, ast, etc.) to English.
High Performance: Achieves a BLEU score of 62.8 and a chr-F score of 0.76737 on the tatoeba-test-v2020-07-28-v2023-09-26 dataset.

📦 Installation

The installation steps are not provided in the original document, so this section is skipped.

💻 Usage Examples

The usage examples are already shown in the "Quick Start" section.

📚 Documentation

Model Details

Neural machine translation model for translating from Romance languages (roa) to English (en).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train.

Property	Details
Developed by	Language Technology Research Group at the University of Helsinki
Model Type	Translation (transformer-big)
Release	2024-08-17
License	Apache-2.0
Source Language(s)	acf arg ast cat cbk cos egl ext fra frm frp fur gcf glg hat ita kea lad lij lld lmo lou mfe mol mwl nap oci osp pap pms por roh ron rup scn spa srd vec wln
Target Language(s)	eng
Original Model	opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.zip
Resources for more information	OPUS-MT dashboard, OPUS-MT-train GitHub Repo, More information about MarianNMT models in the transformers library, Tatoeba Translation Challenge, HPLT bilingual data v1 (as part of the Tatoeba Translation Challenge dataset), A massively parallel Bible corpus

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

⚠️ Important Note

Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

💡 Usage Tip

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

Training

Data: opusTCv20230926max50+bt+jhubc (source)
Pre-processing: SentencePiece (spm32k,spm32k)
Model Type: transformer-big
Original MarianNMT Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.zip
Training Scripts: GitHub Repo

Evaluation

Model scores at the OPUS-MT dashboard
test set translations: opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.test.txt
test set scores: opusTCv20230926max50+bt+jhubc_transformer-big_2024-08-17.eval.txt
benchmark results: benchmark_results.txt
benchmark output: benchmark_translations.zip

langpair	testset	chr-F	BLEU	#sent	#words
multi-eng	tatoeba-test-v2020-07-28-v2023-09-26	0.76737	62.8	10000	87576

Citation Information

Publications: Democratizing neural machine translation with OPUS-MT and OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

transformers version: 4.45.1
OPUS-MT git hash: 0882077
port time: Tue Oct 8 15:26:36 EEST 2024
port machine: LM0-400-22516.local

📄 License

This model is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご