Opus-mt-tc-big-he-en Open-source Translation Model - Supports Precise Translation from Hebrew to English

Opus Mt Tc Big He En

Developed by Helsinki-NLP

This is a neural machine translation model for translating from Hebrew to English, part of the OPUS-MT project, utilizing the transformer-big architecture.

Machine Translation

Transformers

Supports Multiple Languages#Hebrew-English Translation #High BLEU Score #Multilingual Support

Downloads 3,869

Release Time : 4/13/2022

Model Overview

This model is specifically designed for Hebrew-to-English translation tasks, developed under the OPUS-MT project, trained using the Marian NMT framework, and converted to PyTorch format via the transformers library.

Model Features

High-Quality Translation

Achieves a BLEU score of 44.1 on the flores101-devtest dataset and 53.8 on the tatoeba-test-v2021-08-07 dataset.

Multilingual Support

Supports bidirectional translation between Hebrew and English.

Open-Source License

Released under the cc-by-4.0 license, permitting both commercial and research use.

Model Capabilities

Text Translation

Hebrew-to-English Translation

Use Cases

Language Services

Document Translation

Translate Hebrew documents into English.

High-quality translation results with a BLEU score of 53.8.

Real-Time Translation Service

Integrate into chat applications or websites to provide real-time translation functionality.

Education

Language Learning Assistance

Help students understand Hebrew materials.

🚀 opus-mt-tc-big-he-en

A neural machine translation model designed to translate text from Hebrew (he) to English (en). It is part of the OPUS-MT project, aiming to make NMT models accessible for diverse languages.

🚀 Quick Start

This model is a key part of the OPUS-MT project, which endeavors to make neural machine translation models widely available for numerous languages globally. Initially, all models are trained using the excellent Marian NMT framework, an efficient NMT implementation written in pure C++. Subsequently, these models are converted to pyTorch via the transformers library by huggingface. The training data is sourced from OPUS, and the training pipelines follow the procedures of OPUS-MT-train.

Publications: OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please cite these papers if you use this model.)

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

✨ Features

Multilingual Support: Facilitates translation from Hebrew to English, catering to a wide range of language needs.
Efficient Training: Trained with the powerful Marian NMT framework and converted to pyTorch for enhanced performance.
Rich Data Source: Utilizes data from OPUS, ensuring high - quality training and accurate translations.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    "היא שכחה לכתוב לו.",
    "אני רוצה לדעת מיד כשמשהו יקרה."
]

model_name = "pytorch-models/opus-mt-tc-big-he-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     She forgot to write to him.
#     I want to know as soon as something happens.

Advanced Usage

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-he-en")
print(pipe("היא שכחה לכתוב לו."))

# expected output: She forgot to write to him.

📚 Documentation

Model info

Property	Details
Release	2022 - 03 - 13
Source Language(s)	heb
Target Language(s)	eng
Model	transformer - big
Data	opusTCv20210807+bt (source)
Tokenization	SentencePiece (spm32k,spm32k)
Original Model	opusTCv20210807+bt_transformer-big_2022-03-13.zip
More Information	OPUS-MT heb-eng README

Benchmarks

Property	Details
Test Set Translations	opusTCv20210807+bt_transformer-big_2022-03-13.test.txt
Test Set Scores	opusTCv20210807+bt_transformer-big_2022-03-13.eval.txt
Benchmark Results	benchmark_results.txt
Benchmark Output	benchmark_translations.zip

langpair	testset	chr-F	BLEU	#sent	#words
heb-eng	tatoeba-test-v2021-08-07	0.68565	53.8	10519	77427
heb-eng	flores101-devtest	0.68116	44.1	1012	24721

Model conversion info

Property	Details
Transformers Version	4.16.2
OPUS-MT Git Hash	3405783
Port Time	Wed Apr 13 19:27:12 EEST 2022
Port Machine	LM0-400-22516.local

🔧 Technical Details

The model is based on the transformer architecture and uses SentencePiece for tokenization. It is trained on a large - scale dataset from OPUS, which includes a diverse range of text sources. The conversion from the original Marian NMT model to pyTorch is carried out using the transformers library, ensuring compatibility and ease of use.

📄 License

The model is released under the cc - by - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご