🚀 M2M100 1.2B
M2M100 is a multilingual encoder - decoder (seq - to - seq) model. It's designed for Many - to - Many multilingual translation, offering a powerful solution for cross - language communication.
🚀 Quick Start
M2M100 was introduced in this paper and first released in this repository. The model can directly translate between the 9,900 directions of 100 languages. To translate into a target language, force the target language id as the first generated token by passing the forced_bos_token_id
parameter to the generate
method.
⚠️ Important Note
M2M100Tokenizer
depends on sentencepiece
, so make sure to install it before running the example. To install sentencepiece
, run pip install sentencepiece
.
✨ Features
- Multilingual Support: Covers a wide range of languages including Afrikaans (af), Amharic (am), Arabic (ar), etc.
- Direct Translation: Capable of direct translation between 9,900 directions of 100 languages.
📦 Installation
Install Python
First, install Python from here.
Install Required Libraries
Open the cmd
and run the following commands:
python --version
python -m pip install huggingface_hub
Download the Model
import huggingface_hub
huggingface_hub.download_snapshot('entai2965/m2m100-1.2B-ctranslate2', local_dir='m2m100-1.2B-ctranslate2')
💻 Usage Examples
Basic Usage
import ctranslate2
import transformers
translator = ctranslate2.Translator("m2m100-1.2B-ctranslate2", device="cpu")
tokenizer = transformers.AutoTokenizer.from_pretrained("m2m100-1.2B-ctranslate2", clean_up_tokenization_spaces=True)
tokenizer.src_lang = "en"
source = tokenizer.convert_ids_to_tokens(tokenizer.encode("Hello world!"))
target_prefix = [tokenizer.lang_code_to_token["de"]]
results = translator.translate_batch([source], target_prefix=[target_prefix])
target = results[0].hypotheses[0][1:]
print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))
Advanced Usage (Batch Syntax)
import os
import ctranslate2
import transformers
home_path = os.path.expanduser('~')
model_path = home_path + '/Downloads/models/m2m100-1.2B-ctranslate2'
source_language_code = 'ja'
target_language_code = 'es'
device = 'cpu'
string1 = 'イキリカメラマン'
string2 = 'おかあさん'
string3 = '人生はチョコレートの箱のようなものです。彼らは皆毒殺されています。'
list_to_translate = [string1, string2, string3]
translator = ctranslate2.Translator(model_path, device=device)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path, clean_up_tokenization_spaces=True)
tokenizer.src_lang = source_language_code
target_language_token = [tokenizer.lang_code_to_token[target_language_code]]
encoded_list = []
for text in list_to_translate:
encoded_list.append(tokenizer.convert_ids_to_tokens(tokenizer.encode(text)))
translated_list = translator.translate_batch(encoded_list, target_prefix=[target_language_token] * len(encoded_list))
for counter, tokens in enumerate(translated_list):
translated_list[counter] = tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens.hypotheses[0][1:]))
for text in translated_list:
print(text)
Functional Programming Version
import os
import ctranslate2
import transformers
home_path = os.path.expanduser('~')
model_path = home_path + '/Downloads/models/m2m100-1.2B-ctranslate2'
source_language_code = 'ja'
target_language_code = 'es'
device = 'cpu'
string1 = 'イキリカメラマン'
string2 = 'おかあさん'
string3 = '人生はチョコレートの箱のようなものです。彼らは皆毒殺されています。'
list_to_translate = [string1, string2, string3]
translator = ctranslate2.Translator(model_path, device=device)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path, clean_up_tokenization_spaces=True)
tokenizer.src_lang = source_language_code
translated_list = [tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens.hypotheses[0][1:])) for tokens in translator.translate_batch([tokenizer.convert_ids_to_tokens(tokenizer.encode(i)) for i in list_to_translate], target_prefix=[[tokenizer.lang_code_to_token[target_language_code]]] * len(list_to_translate))]
for text in translated_list:
print(text)
📚 Documentation
Languages covered
The model supports the following languages:
Afrikaans (af), Amharic (am), Arabic (ar), Asturian (ast), Azerbaijani (az), Bashkir (ba), Belarusian (be), Bulgarian (bg), Bengali (bn), Breton (br), Bosnian (bs), Catalan; Valencian (ca), Cebuano (ceb), Czech (cs), Welsh (cy), Danish (da), German (de), Greek (el), English (en), Spanish (es), Estonian (et), Persian (fa), Fulah (ff), Finnish (fi), French (fr), Western Frisian (fy), Irish (ga), Gaelic; Scottish Gaelic (gd), Galician (gl), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Croatian (hr), Haitian; Haitian Creole (ht), Hungarian (hu), Armenian (hy), Indonesian (id), Igbo (ig), Iloko (ilo), Icelandic (is), Italian (it), Japanese (ja), Javanese (jv), Georgian (ka), Kazakh (kk), Central Khmer (km), Kannada (kn), Korean (ko), Luxembourgish; Letzeburgesch (lb), Ganda (lg), Lingala (ln), Lao (lo), Lithuanian (lt), Latvian (lv), Malagasy (mg), Macedonian (mk), Malayalam (ml), Mongolian (mn), Marathi (mr), Malay (ms), Burmese (my), Nepali (ne), Dutch; Flemish (nl), Norwegian (no), Northern Sotho (ns), Occitan (post 1500) (oc), Oriya (or), Panjabi; Punjabi (pa), Polish (pl), Pushto; Pashto (ps), Portuguese (pt), Romanian; Moldavian; Moldovan (ro), Russian (ru), Sindhi (sd), Sinhala; Sinhalese (si), Slovak (sk), Slovenian (sl), Somali (so), Albanian (sq), Serbian (sr), Swati (ss), Sundanese (su), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th), Tagalog (tl), Tswana (tn), Turkish (tr), Ukrainian (uk), Urdu (ur), Uzbek (uz), Vietnamese (vi), Wolof (wo), Xhosa (xh), Yiddish (yi), Yoruba (yo), Chinese (zh), Zulu (zu)
🔧 Technical Details
The model is based on the facebook/m2m100_1.2B
base model. It uses a multilingual encoder - decoder architecture for seq - to - seq translation.
📄 License
This project is licensed under the MIT license.
BibTeX entry and citation info
@misc{fan2020englishcentric,
title={Beyond English-Centric Multilingual Machine Translation},
author={Angela Fan and Shruti Bhosale and Holger Schwenk and Zhiyi Ma and Ahmed El-Kishky and Siddharth Goyal and Mandeep Baines and Onur Celebi and Guillaume Wenzek and Vishrav Chaudhary and Naman Goyal and Tom Birch and Vitaliy Liptchinsky and Sergey Edunov and Edouard Grave and Michael Auli and Armand Joulin},
year={2020},
eprint={2010.11125},
archivePrefix={arXiv},
primaryClass={cs.CL}
}