opus-mt-tc-big-en-ar Open-Source Translation Model - Free Multi-Target Translation from English to Arabic

Opus Mt Tc Big En Ar

Developed by Helsinki-NLP

This is a neural machine translation model for translating from English to Arabic, part of the OPUS-MT project, supporting multi-target language translation.

Machine Translation

Transformers

Supports Multiple Languages#English-Arabic Translation #Multi-target Language Support #High-precision Translation

Downloads 4,562

Release Time : 4/13/2022

Model Overview

This model is specifically designed for English-to-Arabic translation tasks, utilizing the transformer-big architecture, trained on data from the OPUS corpus, and supports both Gulf Arabic and Standard Arabic variants.

Model Features

Multi-target Language Support

Supports both Standard Arabic and Gulf Arabic variants, achieving multilingual translation by adding target language ID prefixes.

High-quality Translation

Achieves a BLEU score of 29.4 on the flores101-devtest dataset, demonstrating excellent performance.

Based on OPUS Corpus

Training data comes from the extensive OPUS multilingual corpus, covering various domains and contexts.

Model Capabilities

English-to-Arabic text translation

Supports Standard Arabic and Gulf Arabic variants

Batch text processing

Use Cases

Content Localization

Website Content Translation

Translates English website content into Arabic, supporting multiple regional variants.

Achieves a BLEU score of 29.4 on the flores101 test set

Business Communication

Business Document Translation

Translates formal documents such as business letters and contracts.

Achieves a BLEU score of 30.0 on the tico19 test set

🚀 opus-mt-tc-big-en-ar

A neural machine translation model designed to translate from English (en) to Arabic (ar), contributing to the widespread availability and accessibility of translation technology.

🚀 Quick Start

This model is part of the OPUS-MT project, aiming to make neural machine translation models accessible for numerous languages globally. It's initially trained with the Marian NMT framework, an efficient NMT implementation in pure C++, and then converted to pyTorch using the transformers library by huggingface. The training data is sourced from OPUS, and the training pipelines follow the procedures of OPUS-MT-train.

Publications: OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

✨ Features

Multilingual Support: This is a multilingual translation model with multiple target languages. A sentence initial language token in the form of >>id<< (id = valid target language ID), e.g., >>afb<<, is required.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>ara<< I can't help you because I'm busy.",
    ">>ara<< I have to write a letter. Do you have some paper?"
]

model_name = "pytorch-models/opus-mt-tc-big-en-ar"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     لا أستطيع مساعدتك لأنني مشغول.
#     يجب أن أكتب رسالة هل لديك بعض الأوراق؟

Advanced Usage

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-en-ar")
print(pipe(">>ara<< I can't help you because I'm busy."))

# expected output: لا أستطيع مساعدتك لأنني مشغول.

📚 Documentation

Model info

Property	Details
Release	2022-02-25
Source Language(s)	eng
Target Language(s)	afb ara
Valid Target Language Labels	>>afb<< >>ara<<
Model Type	transformer-big
Training Data	opusTCv20210807+bt (source)
Tokenization	SentencePiece (spm32k,spm32k)
Original Model	opusTCv20210807+bt_transformer-big_2022-02-25.zip
More Info on Released Models	OPUS-MT eng-ara README
More Info about the Model	MarianMT

Benchmarks

Test set translations: opusTCv20210807+bt_transformer-big_2022-02-25.test.txt
Test set scores: opusTCv20210807+bt_transformer-big_2022-02-25.eval.txt
Benchmark results: benchmark_results.txt
Benchmark output: benchmark_translations.zip

Property	Details
eng-ara on tatoeba-test-v2021-08-07	chr-F: 0.48813, BLEU: 19.8, #sent: 10305, #words: 61356
eng-ara on flores101-devtest	chr-F: 0.61154, BLEU: 29.4, #sent: 1012, #words: 21357
eng-ara on tico19-test	chr-F: 0.60075, BLEU: 30.0, #sent: 2100, #words: 51339

Acknowledgements

The work is supported by the European Language Grid as pilot project 2866, by the FoTran project, funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771113), and the MeMAD project, funded by the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No 780069. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland.

Model conversion info

Property	Details
Transformers Version	4.16.2
OPUS-MT Git Hash	3405783
Port Time	Wed Apr 13 16:37:31 EEST 2022
Port Machine	LM0-400-22516.local

📄 License

The model is licensed under cc-by-4.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご