Opus-MT-En-INE Open-Source Multilingual Translation Model - Free English to Indo-European Languages Translation

Opus Mt En Ine

Developed by Helsinki-NLP

This is a Transformer-based multilingual machine translation model supporting translation tasks from English to multiple Indo-European languages.

Machine Translation

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual translation #Indo-European language support #Transformer architecture

Downloads 83

Release Time : 3/2/2022

Model Overview

The model specializes in translation tasks from English to Indo-European languages, covering various languages including French, German, Spanish, Russian, and more.

Model Features

Multilingual support

Supports translation tasks from English to multiple Indo-European languages, covering a wide range of language varieties.

Standardized preprocessing

Uses standardization and SentencePiece for preprocessing to improve translation quality.

Language identifier

Requires adding a target language identifier at the beginning of the sentence in the format `>>target_language_ID<<` to facilitate multilingual translation.

Model Capabilities

English to Indo-European language translation

Multilingual machine translation

Standardized text processing

Use Cases

Multilingual translation

News translation

Translate English news into multiple Indo-European languages

Performs well on multiple news test sets, e.g., French BLEU score of 30.0

Multilingual content generation

Provide content translation for multilingual websites or applications

Supports automatic translation in multiple languages

🚀 eng-ine

This is a translation model that translates English into various Indo - European languages, providing a wide - range of language translation capabilities.

🚀 Quick Start

This eng-ine model is designed for translating English to a multitude of Indo - European languages. To get started, you need to ensure you have the necessary pre - processing steps in place. A sentence initial language token in the form of >>id<< (where id is a valid target language ID) is required.

✨ Features

Broad Language Support: Supports a vast array of Indo - European languages, including but not limited to Afrikaans, Arabic, Armenian, Basque, Belarusian, Bulgarian, Catalan, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, Irish, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Macedonian, Malayalam, Marathi, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, and Welsh.
Transformer Model: Utilizes a transformer architecture, which is known for its effectiveness in handling sequence - to - sequence tasks such as machine translation.
Pre - processing: Incorporates normalization and SentencePiece (spm32k, spm32k) for pre - processing, which helps in better handling of text data.

📦 Installation

There is no specific installation steps provided in the original document. If you want to use this model, you may need to download the original weights from the provided link: [opus2m - 2020 - 08 - 01.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/eng - ine/opus2m - 2020 - 08 - 01.zip)

💻 Usage Examples

There are no code examples provided in the original document. However, in a general machine - translation scenario, you might use a framework like Hugging Face's transformers library. Here is a simple pseudo - code example:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load the model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained('path/to/downloaded/model')
tokenizer = AutoTokenizer.from_pretrained('path/to/downloaded/tokenizer')

# Input English text
input_text = "This is a test sentence."
# Add the target language token (e.g., for Spanish)
input_text_with_token = ">>spa<< " + input_text

# Tokenize the input
input_ids = tokenizer.encode(input_text_with_token, return_tensors='pt')

# Generate translation
outputs = model.generate(input_ids)

# Decode the output
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)

📚 Documentation

Model Information

Property	Details
Model Type	Transformer
Source Language	English (eng)
Target Languages	afr aln ang_Latn arg asm ast awa bel bel_Latn ben bho bos_Latn bre bul bul_Latn cat ces cor cos csb_Latn cym dan deu dsb egl ell enm_Latn ext fao fra frm_Latn frr fry gcf_Latn gla gle glg glv gom gos got_Goth grc_Grek gsw guj hat hif_Latn hin hrv hsb hye ind isl ita jdt_Cyrl ksh kur_Arab kur_Latn lad lad_Latn lat_Latn lav lij lit lld_Latn lmo ltg ltz mai mar max_Latn mfe min mkd mwl nds nld nno nob nob_Hebr non_Latn npi oci ori orv_Cyrl oss pan_Guru pap pdc pes pes_Latn pes_Thaa pms pnb pol por prg_Latn pus roh rom ron rue rus san_Deva scn sco sgs sin slv snd_Arab spa sqi srp_Cyrl srp_Latn stq swe swg tgk_Cyrl tly_Latn tmw_Latn ukr urd vec wln yid zlm_Latn zsm_Latn zza
Pre - processing	Normalization + SentencePiece (spm32k, spm32k)
Original Weights Download	[opus2m - 2020 - 08 - 01.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/eng - ine/opus2m - 2020 - 08 - 01.zip)
Test Set Translations	[opus2m - 2020 - 08 - 01.test.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/eng - ine/opus2m - 2020 - 08 - 01.test.txt)
Test Set Scores	[opus2m - 2020 - 08 - 01.eval.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/eng - ine/opus2m - 2020 - 08 - 01.eval.txt)

Benchmarks

testset	BLEU	chr - F
newsdev2014 - enghin.eng.hin	6.2	0.317
newsdev2016 - enro - engron.eng.ron	22.1	0.525
newsdev2017 - enlv - englav.eng.lav	17.4	0.486
newsdev2019 - engu - engguj.eng.guj	6.5	0.303
newsdev2019 - enlt - englit.eng.lit	14.9	0.476
newsdiscussdev2015 - enfr - engfra.eng.fra	26.4	0.547
newsdiscusstest2015 - enfr - engfra.eng.fra	30.0	0.575
newssyscomb2009 - engces.eng.ces	14.7	0.442
newssyscomb2009 - engdeu.eng.deu	16.7	0.487
newssyscomb2009 - engfra.eng.fra	24.8	0.547
newssyscomb2009 - engita.eng.ita	25.2	0.562
newssyscomb2009 - engspa.eng.spa	27.0	0.554
news - test2008 - engces.eng.ces	13.0	0.417
news - test2008 - engdeu.eng.deu	17.4	0.480
news - test2008 - engfra.eng.fra	22.3	0.519
news - test2008 - engspa.eng.spa	24.9	0.532
newstest2009 - engces.eng.ces	13.6	0.432
newstest2009 - engdeu.eng.deu	16.6	0.482
newstest2009 - engfra.eng.fra	23.5	0.535
newstest2009 - engita.eng.ita	25.5	0.561
newstest2009 - engspa.eng.spa	26.3	0.551
newstest2010 - engces.eng.ces	14.2	0.436
newstest2010 - engdeu.eng.deu	18.3	0.492
newstest2010 - engfra.eng.fra	25.7	0.550
newstest2010 - engspa.eng.spa	30.5	0.578
newstest2011 - engces.eng.ces	15.1	0.439
newstest2011 - engdeu.eng.deu	17.1	0.478
newstest2011 - engfra.eng.fra	28.0	0.569
newstest2011 - engspa.eng.spa	31.9	0.580
newstest2012 - engces.eng.ces	13.6	0.418
newstest2012 - engdeu.eng.deu	17.0	0.475
newstest2012 - engfra.eng.fra	26.1	0.553
newstest2012 - engrus.eng.rus	21.4	0.506
newstest2012 - engspa.eng.spa	31.4	0.577
newstest2013 - engces.eng.ces	15.3	0.438
newstest2013 - engdeu.eng.deu	20.3	0.501
newstest2013 - engfra.eng.fra	26.0	0.540
newstest2013 - engrus.eng.rus	16.1	0.449
newstest2013 - engspa.eng.spa	28.6	0.555
newstest2014 - hien - enghin.eng.hin	9.5	0.344
newstest2015 - encs - engces.eng.ces	14.8	0.440
newstest2015 - ende - engdeu.eng.deu	22.6	0.523
newstest2015 - enru - engrus.eng.rus	18.8	0.483
newstest2016 - encs - engces.eng.ces	16.8	0.457
newstest2016 - ende - engdeu.eng.deu	26.2	0.555
newstest2016 - enro - engron.eng.ron	21.2	0.510
newstest2016 - enru - engrus.eng.rus	17.6	0.471
newstest2017 - encs - engces.eng.ces	13.6	0.421
newstest2017 - ende - engdeu.eng.deu	21.5	0.516
newstest2017 - enlv - englav.eng.lav	13.0	0.452
newstest2017 - enru - engrus.eng.rus	18.7	0.486
newstest2018 - encs - engces.eng.ces	13.5	0.425
newstest2018 - ende - engdeu.eng.deu	29.8	0.581
newstest2018 - enru - engrus.eng.rus	16.1	0.472
newstest2019 - encs - engces.eng.ces	14.8	0.435
newstest2019 - ende - engdeu.eng.deu	26.6	0.554
newstest2019 - engu - engguj.eng.guj	6.9	0.313
newstest2019 - enlt - englit.eng.lit	10.6	0.429
newstest2019 - enru - engrus.eng.rus	17.5	0.452
Tatoeba - test.eng - afr.eng.afr	52.1	0.708
Tatoeba - test.eng - ang.eng.ang	5.1	0.131
Tatoeba - test.eng - arg.eng.arg	1.2	0.099
Tatoeba - test.eng - asm.eng.asm	2.9	0.259
Tatoeba - test.eng - ast.eng.ast	14.1	0.408
Tatoeba - test.eng - awa.eng.awa	0.3	0.002
Tatoeba - test.eng - bel.eng.bel	18.1	0.450
Tatoeba - test.eng - ben.eng.ben	13.5	0.432
Tatoeba - test.eng - bho.eng.bho	0.3	0.003
Tatoeba - test.eng - bre.eng.bre	10.4	0.318
Tatoeba - test.eng - bul.eng.bul	38.7	0.592
Tatoeba - test.eng - cat.eng.cat	42.0	0.633
Tatoeba - test.eng - ces.eng.ces	32.3	0.546
Tatoeba - test.eng - cor.eng.cor	0.5	0.079
Tatoeba - test.eng - cos.eng.cos	3.1	0.148
Tatoeba - test.eng - csb.eng.csb	1.4	0.216
Tatoeba - test.eng - cym.eng.cym	22.4	0.470
Tatoeba - test.eng - dan.eng.dan	49.7	0.671
Tatoeba - test.eng - deu.eng.deu	31.7	0.554
Tatoeba - test.eng - dsb.eng.dsb	1.1	0.139
Tatoeba - test.eng - egl.eng.egl	0.9	0.089
Tatoeba - test.eng - ell.eng.ell	42.7	0.640
Tatoeba - test.eng - enm.eng.enm	3.5	0.259
Tatoeba - test.eng - ext.eng.ext	6.4	0.235
Tatoeba - test.eng - fao.eng.fao	6.6	0.285
Tatoeba - test.eng - fas.eng.fas	5.7	0.257
Tatoeba - test.eng - fra.eng.fra	38.4	0.595
Tatoeba - test.eng - frm.eng.frm	0.9	0.149
Tatoeba - test.eng - frr.eng.frr	8.4	0.145
Tatoeba - test.eng - fry.eng.fry	16.5	0.411
Tatoeba - test.eng - gcf.eng.gcf	0.6	0.098
Tatoeba - test.eng - gla.eng.gla	11.6	0.361
Tatoeba - test.eng - gle.eng.gle	32.5	0.546
Tatoeba - test.eng - glg.eng.glg	38.4	0.602
Tatoeba - test.eng - glv.eng.glv	23.1	0.418
Tatoeba - test.eng - gos.eng.gos	0.7	0.137
Tatoeba - test.eng - got.eng.got	0.2	0.010
Tatoeba - test.eng - grc.eng.grc	0.0	0.005
Tatoeba - test.eng - gsw.eng.gsw	0.9	0.108
Tatoeba - test.eng - guj.eng.guj	20.8	0.391
Tatoeba - test.eng - hat.eng.hat	34.0	0.537
Tatoeba - test.eng - hbs.eng.hbs	33.7	0.567
Tatoeba - test.eng - hif.eng.hif	2.8	0.269
Tatoeba - test.eng - hin.eng.hin	15.6	0.437
Tatoeba - test.eng - hsb.eng.hsb	5.4	0.320
Tatoeba - test.eng - hye.eng.hye	17.4	0.426
Tatoeba - test.eng - isl.eng.isl	17.4	0.436
Tatoeba - test.eng - ita.eng.ita	40.4	0.636
Tatoeba - test.eng - jdt.eng.jdt	6.4	0.008
Tatoeba - test.eng - kok.eng.kok	6.6	0.005
Tatoeba - test.eng - ksh.eng.ksh	0.8	0.123
Tatoeba - test.eng - kur.eng.kur	10.2	0.209
Tatoeba - test.eng - lad.eng.lad	0.8	0.163
Tatoeba - test.eng - lah.eng.lah	0.2	0.001
Tatoeba - test.eng - lat.eng.lat	9.4	0.372
Tatoeba - test.eng - lav.eng.lav	30.3	0.559
Tatoeba - test.eng - lij.eng.lij	1.0	0.130
Tatoeba - test.eng - lit.eng.lit	25.3	0.560
Tatoeba - test.eng - lld.eng.lld	0.4	0.139
Tatoeba - test.eng - lmo.eng.lmo	0.6	0.108
Tatoeba - test.eng - ltz.eng.ltz	18.1	0.388
Tatoeba - test.eng - mai.eng.mai	17.2	0.464
Tatoeba - test.eng - mar.eng.mar	18.0	0.451
Tatoeba - test.eng - mfe.eng.mfe	81.0	0.899
Tatoeba - test.eng - mkd.eng.mkd	37.6	0.587
Tatoeba - test.eng - msa.eng.msa	27.7	0.519
Tatoeba - test.eng - multi	32.6	0.539
Tatoeba - test.eng - mwl.eng.mwl	3.8	0.134
Tatoeba - test.eng - nds.eng.nds	14.3	0.401
Tatoeba - test.eng - nep.eng.nep	0.5	0.002
Tatoeba - test.eng - nld.eng.nld	44.0	0.642
Tatoeba - test.eng - non.eng.non	0.7	0.118
Tatoeba - test.eng - nor.eng.nor	42.7	0.623
Tatoeba - test.eng - oci.eng.oci	7.2	0.295
Tatoeba - test.eng - ori.eng.ori	2.7	0.257
Tatoeba - test.eng - orv.eng.orv	0.2	0.008
Tatoeba - test.eng - oss.eng.oss	2.9	0.264
Tatoeba - test.eng - pan.eng.pan	7.4	0.337
Tatoeba - test.eng - pap.eng.pap	48.5	0.656
Tatoeba - test.eng - pdc.eng.pdc	1.8	0.145
Tatoeba - test.eng - pms.eng.pms	0.7	0.136
Tatoeba - test.eng - pol.eng.pol	31.1	0.563
Tatoeba - test.eng - por.eng.por	37.0	0.605
Tatoeba - test.eng - prg.eng.prg	0.2	0.100
Tatoeba - test.eng - pus.eng.pus	1.0	0.134
Tatoeba - test.eng - roh.eng.roh	2.3	0.236
Tatoeba - test.eng - rom.eng.rom	7.8	0.340
Tatoeba - test.eng - ron.eng.ron	34.3	0.585
Tatoeba - test.eng - rue.eng.rue	0.2	0.010
Tatoeba - test.eng - rus.eng.rus	29.6	0.526
Tatoeba - test.eng - san.eng.san	2.4	0.125
Tatoeba - test.eng - scn.eng.scn	1.6	0.079
Tatoeba - test.eng - sco.eng.sco	33.6	0.562
Tatoeba - test.eng - sgs.eng.sgs	3.4	0.114
Tatoeba - test.eng - sin.eng.sin	9.2	0.349
Tatoeba - test.eng - slv.eng.slv	15.6	0.334
Tatoeba - test.eng - snd.eng.snd	9.1	0.324
Tatoeba - test.eng - spa.eng.spa	43.4	0.645
Tatoeba - test.eng - sqi.eng.sqi	39.0	0.621
Tatoeba - test.eng - stq.eng.stq	10.8	0.373
Tatoeba - test.eng - swe.eng.swe	49.9	0.663
Tatoeba - test.eng - swg.eng.swg	0.7	0.137
Tatoeba - test.eng - tgk.eng.tgk	6.4	0.346
Tatoeba - test.eng - tly.eng.tly	0.5	0.055
Tatoeba - test.eng - ukr.eng.ukr	31.4	0.536
Tatoeba - test.eng - urd.eng.urd	11.1	0.389
Tatoeba - test.eng - vec.eng.vec	1.3	0.110
Tatoeba - test.eng - wln.eng.wln	6.8	0.233
Tatoeba - test.eng - yid.eng.yid	5.8	0.295
Tatoeba - test.eng - zza.eng.zza	0.8	0.086

System Info

Property	Details
hf_name	eng - ine
Source Languages	eng
Target Languages	ine
OPUS Readme URL	[https://github.com/Helsinki - NLP/Tatoeba - Challenge/tree/master/models/eng - ine/README.md](https://github.com/Helsinki - NLP/Tatoeba - Challenge/tree/master/models/eng - ine/README.md)
Original Repo	Tatoeba - Challenge
Tags	['translation']
Languages	['en', 'ca', 'es', 'os', 'ro', 'fy', 'cy', 'sc', 'is', 'yi', 'lb', 'an', 'sq', 'fr', 'ht', 'rm', 'ps', 'af', 'uk', 'sl', 'lt', 'bg', 'be', 'gd', 'si', 'br', 'mk', 'or', 'mr', 'ru', 'fo', 'co', 'oc', 'pl', 'gl', 'nb', 'bn', 'id', 'hy', 'da', 'gv', 'nl', 'pt', 'hi', 'as', 'kw', 'ga', 'sv', 'gu', 'wa', 'lv', 'el', 'it', 'hr', 'ur', 'nn', 'de', 'cs', 'ine']
Source Constituents	{'eng'}
Target Constituents	{'cat', 'spa', 'pap', 'mwl', 'lij', 'bos_Latn', 'lad_Latn', 'lat_Latn', 'pcd', 'oss', 'ron', 'fry', 'cym', 'awa', 'swg', 'zsm_Latn', 'srd', 'gcf_Latn', 'isl', 'yid', 'bho', 'ltz', 'kur_Latn', 'arg', 'pes_Thaa', 'sqi', 'csb_Latn', 'fra', 'hat', 'non_Latn', 'sco', 'pnb', 'roh', 'bul_Latn', 'pus', 'afr', 'ukr', 'slv', 'lit', 'tmw_Latn', 'hsb', 'tly_Latn', 'bul', 'bel', 'got_Goth', 'lat_Grek', 'ext', 'gla', 'mai', 'sin', 'hif_Latn', 'eng', 'bre', 'nob_Hebr', 'prg_Latn', 'ang_Latn', 'aln', 'mkd', 'ori', 'mar', 'afr_Arab', 'san_Deva', 'gos', 'rus', 'fao', 'orv_Cyrl', 'bel_Latn', 'cos', 'zza', 'grc_Grek', 'oci', 'mfe', 'gom', 'bjn', 'sgs', 'tgk_Cyrl', 'hye_Latn', 'pdc', 'srp_Cyrl', 'pol', 'ast', 'glg', 'pms', 'nob', 'ben', 'min', 'srp_Latn', 'zlm_Latn', 'ind', 'rom', 'hye', 'scn', 'enm_Latn', 'lmo', 'npi', 'pes', 'dan', 'rus_Latn', 'jdt_Cyrl', 'gsw', 'glv', 'nld', 'snd_Arab', 'kur_Arab', 'por', 'hin', 'dsb', 'asm', 'lad', 'frm_Latn', 'ksh', 'pan_Guru', 'cor', 'gle', 'swe', 'guj', 'wln', 'lav', 'ell', 'frr', 'rue', 'ita', 'hrv', 'urd', 'stq', 'nno', 'deu', 'lld_Latn', 'ces', 'egl', 'vec', 'max_Latn', 'pes_Latn', 'ltg', 'nds'}
Source Multilingual	False
Target Multilingual	True
Pre - processing	Normalization + SentencePiece (spm32k, spm32k)
URL Model	[https://object.pouta.csc.fi/Tatoeba - MT - models/eng - ine/opus2m - 2020 - 08 - 01.zip](https://object.pouta.csc.fi/Tatoeba - MT - models/eng - ine/opus2m - 2020 - 08 - 01.zip)
URL Test Set	[https://object.pouta.csc.fi/Tatoeba - MT - models/eng - ine/opus2m - 2020 - 08 - 01.test.txt](https://object.pouta.csc.fi/Tatoeba - MT - models/eng - ine/opus2m - 2020 - 08 - 01.test.txt)
Source Alpha3	eng
Target Alpha3	ine
Short Pair	en - ine
chrF2 Score	0.539
BLEU	32.6
Brevity Penalty	0.973
Ref Len	68664.0
Source Name	English
Target Name	Indo - European languages
Train Date	2020 - 08 - 01
Source Alpha2	en
Target Alpha2	ine
Prefer Old	False
Helsinki Git SHA	480fcbe0ee1bf4774bcbe6226ad9f58e63f6c535
Transformers Git SHA	2207e5d8cb224e954a7cba69fa4ac23

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご