Opus-mt-en-iir Open-source Multilingual Translation Model - Free Deployment, Supports English to Indo-Iranian Languages Translation

Opus Mt En Iir

Developed by Helsinki-NLP

This is a multilingual machine translation model based on the Transformer architecture, supporting translation tasks from English to various Indo-Iranian languages.

Machine Translation

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual Indo-Iranian translation #Low-resource language support #SentencePiece tokenization

Downloads 135

Release Time : 3/2/2022

Model Overview

This model is specifically designed for translations from English to Indo-Iranian languages, supporting multiple languages including Hindi, Bengali, Urdu, and more.

Model Features

Multilingual Support

Supports translation from English to 28 Indo-Iranian languages

Standardized Preprocessing

Uses standardization and SentencePiece tokenization (spm32k) for text preprocessing

Target Language Identification

Implements multilingual translation by adding target language identifiers at the beginning of sentences

Model Capabilities

Translation from English to multiple Indo-Iranian languages

Multilingual parallel translation

Standardized text processing

Use Cases

Cross-language Communication

English to Hindi Translation

Translate English content into Hindi

BLEU 17.0, chr-F 0.461

English to Bengali Translation

Translate English content into Bengali

BLEU 15.3, chr-F 0.459

Content Localization

English to Urdu Content Localization

Localize English content into Urdu

BLEU 12.3, chr-F 0.409

🚀 eng-iir Translation Model

This project focuses on the translation from English to Indo - Iranian languages. It provides a transformer - based model with specific pre - processing steps and offers detailed benchmarks and system information.

🚀 Quick Start

The model is of the transformer type.
It requires pre - processing with normalization and SentencePiece (spm32k,spm32k).
A sentence initial language token in the form of >>id<< (where id is a valid target language ID) is necessary.

✨ Features

Multi - target Translation: Capable of translating English into multiple Indo - Iranian languages, including asm, awa, ben, etc.
Benchmarked Performance: Provides BLEU and chr - F scores on various test sets to evaluate the model's performance.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

No code examples are provided in the original document.

📚 Documentation

eng - iir Model Details

Source Group: English
Target Group: Indo - Iranian languages
OPUS Readme: eng - iir
Model: transformer
Source Language(s): eng
Target Language(s): asm awa ben bho gom guj hif_Latn hin jdt_Cyrl kur_Arab kur_Latn mai mar npi ori oss pan_Guru pes pes_Latn pes_Thaa pnb pus rom san_Deva sin snd_Arab tgk_Cyrl tly_Latn urd zza
Pre - processing: normalization + SentencePiece (spm32k,spm32k)
Download Original Weights: opus2m - 2020 - 08 - 01.zip
Test Set Translations: opus2m - 2020 - 08 - 01.test.txt
Test Set Scores: opus2m - 2020 - 08 - 01.eval.txt

Benchmarks

Testset	BLEU	chr - F
newsdev2014 - enghin.eng.hin	6.7	0.326
newsdev2019 - engu - engguj.eng.guj	6.0	0.283
newstest2014 - hien - enghin.eng.hin	10.4	0.353
newstest2019 - engu - engguj.eng.guj	6.6	0.282
Tatoeba - test.eng - asm.eng.asm	2.7	0.249
Tatoeba - test.eng - awa.eng.awa	0.4	0.122
Tatoeba - test.eng - ben.eng.ben	15.3	0.459
Tatoeba - test.eng - bho.eng.bho	3.7	0.161
Tatoeba - test.eng - fas.eng.fas	3.4	0.227
Tatoeba - test.eng - guj.eng.guj	18.5	0.365
Tatoeba - test.eng - hif.eng.hif	1.0	0.064
Tatoeba - test.eng - hin.eng.hin	17.0	0.461
Tatoeba - test.eng - jdt.eng.jdt	3.9	0.122
Tatoeba - test.eng - kok.eng.kok	5.5	0.059
Tatoeba - test.eng - kur.eng.kur	4.0	0.125
Tatoeba - test.eng - lah.eng.lah	0.3	0.008
Tatoeba - test.eng - mai.eng.mai	9.3	0.445
Tatoeba - test.eng - mar.eng.mar	20.7	0.473
Tatoeba - test.eng.multi	13.7	0.392
Tatoeba - test.eng - nep.eng.nep	0.6	0.060
Tatoeba - test.eng - ori.eng.ori	2.4	0.193
Tatoeba - test.eng - oss.eng.oss	2.1	0.174
Tatoeba - test.eng - pan.eng.pan	9.7	0.355
Tatoeba - test.eng - pus.eng.pus	1.0	0.126
Tatoeba - test.eng - rom.eng.rom	1.3	0.230
Tatoeba - test.eng - san.eng.san	1.3	0.101
Tatoeba - test.eng - sin.eng.sin	11.7	0.384
Tatoeba - test.eng - snd.eng.snd	2.8	0.180
Tatoeba - test.eng - tgk.eng.tgk	8.1	0.353
Tatoeba - test.eng - tly.eng.tly	0.5	0.015
Tatoeba - test.eng - urd.eng.urd	12.3	0.409
Tatoeba - test.eng - zza.eng.zza	0.5	0.025

System Info

Property	Details
hf_name	eng - iir
source_languages	eng
target_languages	iir
opus_readme_url	https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/eng-iir/README.md
original_repo	Tatoeba - Challenge
tags	['translation']
languages	['en', 'bn', 'or', 'gu', 'mr', 'ur', 'hi', 'ps', 'os', 'as', 'si', 'iir']
src_constituents	{'eng'}
tgt_constituents	{'pnb', 'gom', 'ben', 'hif_Latn', 'ori', 'guj', 'pan_Guru', 'snd_Arab', 'npi', 'mar', 'urd', 'pes', 'bho', 'kur_Arab', 'tgk_Cyrl', 'hin', 'kur_Latn', 'pes_Thaa', 'pus', 'san_Deva', 'oss', 'tly_Latn', 'jdt_Cyrl', 'asm', 'zza', 'rom', 'mai', 'pes_Latn', 'awa', 'sin'}
src_multilingual	False
tgt_multilingual	True
prepro	normalization + SentencePiece (spm32k,spm32k)
url_model	https://object.pouta.csc.fi/Tatoeba-MT-models/eng-iir/opus2m-2020-08-01.zip
url_test_set	https://object.pouta.csc.fi/Tatoeba-MT-models/eng-iir/opus2m-2020-08-01.test.txt
src_alpha3	eng
tgt_alpha3	iir
short_pair	en - iir
chrF2_score	0.392
bleu	13.7
brevity_penalty	1.0
ref_len	63351.0
src_name	English
tgt_name	Indo - Iranian languages
train_date	2020 - 08 - 01
src_alpha2	en
tgt_alpha2	iir
prefer_old	False
long_pair	eng - iir
helsinki_git_sha	480fcbe0ee1bf4774bcbe6226ad9f58e63f6c535
transformers_git_sha	2207e5d8cb224e954a7cba69fa4ac2309e9ff30b
port_machine	brutasse
port_time	2020 - 08 - 21 - 14:41

📄 License

The project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご