Opus-mt-tc-big-en-pt Open-source Translation Model - Free and Precise English-to-Portuguese Translation

Opus Mt Tc Big En Pt

Developed by Helsinki-NLP

This is a neural machine translation model for English to Portuguese (including Brazilian Portuguese), part of the OPUS-MT project.

Machine Translation

Transformers

Supports Multiple Languages#English-Portuguese bidirectional translation #Multi-dialect support #High BLEU score

Downloads 65.51k

Release Time : 4/13/2022

Model Overview

This model is specifically designed for translating English text into Portuguese, supporting both Brazilian and European Portuguese variants. It is based on the transformer-big architecture and uses SentencePiece for tokenization.

Model Features

Multi-target language support

Supports translation to both Brazilian and European Portuguese by adding target language tags (e.g., >>por<<) before the input.

High-performance translation

Achieves BLEU scores of 50.4 and 49.6 on the flores101-devtest and tatoeba-test-v2021-08-07 test sets, respectively.

Open-source license

Uses the cc-by-4.0 license, allowing for both commercial and research use.

Model Capabilities

English to Portuguese text translation

Supports Brazilian and European Portuguese variants

Use Cases

Content localization

Website content translation

Translate English website content into Portuguese for the Brazilian or Portuguese markets.

High-quality translation with a BLEU score of 50.4

Document translation

Business document translation

Translate English business contracts or reports into Portuguese.

Maintains accuracy of professional terminology

🚀 opus-mt-tc-big-en-pt

A neural machine translation model designed to translate from English (en) to Portuguese (pt). This project aims to make neural machine translation models accessible for various languages.

🚀 Quick Start

This model is part of the OPUS-MT project, an initiative to make neural machine translation models widely available for numerous languages globally. All models are initially trained using the Marian NMT framework, an efficient NMT implementation in pure C++. The models are then converted to pyTorch using the transformers library by huggingface. Training data is sourced from OPUS, and training pipelines follow the procedures of OPUS-MT-train.

Publications: OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

✨ Features

Multilingual Support: This is a multilingual translation model with multiple target languages.
Initial Language Token: A sentence initial language token in the form of >>id<< (id = valid target language ID) is required, e.g., >>pob<<.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>por<< Tom tried to stab me.",
    ">>por<< He has been to Hawaii several times."
]

model_name = "pytorch-models/opus-mt-tc-big-en-pt"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     O Tom tentou esfaquear-me.
#     Ele já esteve no Havaí várias vezes.

Advanced Usage

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-en-pt")
print(pipe(">>por<< Tom tried to stab me."))

# expected output: O Tom tentou esfaquear-me.

📚 Documentation

Model Info

Property	Details
Release	2022-03-13
Source Language(s)	eng
Target Language(s)	pob por
Valid Target Language Labels	>>pob<< >>por<<
Model Type	transformer-big
Training Data	opusTCv20210807+bt (source)
Tokenization	SentencePiece (spm32k,spm32k)
Original Model	opusTCv20210807+bt_transformer-big_2022-03-13.zip
More Info on Released Models	OPUS-MT eng-por README
More Info about the Model	MarianMT

Benchmarks

Test set translations: opusTCv20210807+bt_transformer-big_2022-03-13.test.txt
Test set scores: opusTCv20210807+bt_transformer-big_2022-03-13.eval.txt
Benchmark results: benchmark_results.txt
Benchmark output: benchmark_translations.zip

langpair	testset	chr-F	BLEU	#sent	#words
eng-por	tatoeba-test-v2021-08-07	0.69320	49.6	13222	105265
eng-por	flores101-devtest	0.71673	50.4	1012	26519

Acknowledgements

The work is supported by the European Language Grid as pilot project 2866, by the FoTran project, funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771113), and the MeMAD project, funded by the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No 780069. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland.

Model Conversion Info

Property	Details
Transformers Version	4.16.2
OPUS-MT Git Hash	3405783
Port Time	Wed Apr 13 17:48:54 EEST 2022
Port Machine	LM0-400-22516.local

📄 License

This model is released under the cc-by-4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご