m2m100-1.2B-ctranslate2 Open-source Multilingual Translation Model - Supports Direct Mutual Translation Among 100 Languages

M2m100 1.2B Ctranslate2

Developed by entai2965

M2M100 is a multilingual encoder-decoder model supporting direct translation between 100 languages with 1.2 billion parameters.

Machine Translation Supports Multiple LanguagesOpen Source License:MIT #Hundred Languages Translation #No Intermediate Language #Large-scale Multilingual

Downloads 92

Release Time : 11/17/2024

Model Overview

A sequence-to-sequence model designed for many-to-many multilingual translation, capable of direct translation across 9,900 translation directions among 100 languages.

Model Features

Multilingual Direct Translation

Supports 9,900 translation directions among 100 languages without English as an intermediary.

Large-scale Parameters

1.2 billion parameters ensure high-quality translation performance.

Open-source License

Released under MIT license, allowing free commercial and research use.

Model Capabilities

Multilingual text translation

Cross-language text conversion

Large-scale language processing

Use Cases

Translation Services

Multilingual Website Localization

Directly translate website content into multiple target languages.

Supports translation among 100 languages.

Cross-language Document Translation

Direct translation between documents in different languages.

Accurate semantic conversion while preserving original meaning.

Language Research

Low-resource Language Research

Provides translation support for less-resourced languages.

Covers multiple low-resource languages.

🚀 M2M100 1.2B

M2M100 is a multilingual encoder - decoder (seq - to - seq) model. It's designed for Many - to - Many multilingual translation, offering a powerful solution for cross - language communication.

🚀 Quick Start

M2M100 was introduced in this paper and first released in this repository. The model can directly translate between the 9,900 directions of 100 languages. To translate into a target language, force the target language id as the first generated token by passing the forced_bos_token_id parameter to the generate method.

⚠️ Important Note

M2M100Tokenizer depends on sentencepiece, so make sure to install it before running the example. To install sentencepiece, run pip install sentencepiece.

✨ Features

Multilingual Support: Covers a wide range of languages including Afrikaans (af), Amharic (am), Arabic (ar), etc.
Direct Translation: Capable of direct translation between 9,900 directions of 100 languages.

📦 Installation

Install Python

First, install Python from here.

Install Required Libraries

Open the cmd and run the following commands:

python --version
python -m pip install huggingface_hub

Download the Model

import huggingface_hub
huggingface_hub.download_snapshot('entai2965/m2m100-1.2B-ctranslate2', local_dir='m2m100-1.2B-ctranslate2')

💻 Usage Examples

Basic Usage

import ctranslate2
import transformers

translator = ctranslate2.Translator("m2m100-1.2B-ctranslate2", device="cpu")
tokenizer = transformers.AutoTokenizer.from_pretrained("m2m100-1.2B-ctranslate2", clean_up_tokenization_spaces=True)
tokenizer.src_lang = "en"

source = tokenizer.convert_ids_to_tokens(tokenizer.encode("Hello world!"))
target_prefix = [tokenizer.lang_code_to_token["de"]]
results = translator.translate_batch([source], target_prefix=[target_prefix])
target = results[0].hypotheses[0][1:]

print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))

Advanced Usage (Batch Syntax)

import os
import ctranslate2
import transformers

# set defaults
home_path = os.path.expanduser('~')
model_path = home_path + '/Downloads/models/m2m100-1.2B-ctranslate2'

# available languages list ->  https://huggingface.co/facebook/m2m100_1.2B   <-
source_language_code = 'ja'
target_language_code = 'es'

device = 'cpu'
# device = 'cuda'

# load data
string1 = 'イキリカメラマン'
string2 = 'おかあさん'
string3 = '人生はチョコレートの箱のようなものです。彼らは皆毒殺されています。'
list_to_translate = [string1, string2, string3]

# load model and tokenizer
translator = ctranslate2.Translator(model_path, device=device)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path, clean_up_tokenization_spaces=True)

# configure languages
tokenizer.src_lang = source_language_code
target_language_token = [tokenizer.lang_code_to_token[target_language_code]]

# encode
encoded_list = []
for text in list_to_translate:
    encoded_list.append(tokenizer.convert_ids_to_tokens(tokenizer.encode(text)))

# translate
# https://opennmt.net/CTranslate2/python/ctranslate2.Translator.html?#ctranslate2.Translator.translate_batch
translated_list = translator.translate_batch(encoded_list, target_prefix=[target_language_token] * len(encoded_list))

# decode
for counter, tokens in enumerate(translated_list):
    translated_list[counter] = tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens.hypotheses[0][1:]))

# output
for text in translated_list:
    print(text)

Functional Programming Version

import os
import ctranslate2
import transformers

# set defaults
home_path = os.path.expanduser('~')
model_path = home_path + '/Downloads/models/m2m100-1.2B-ctranslate2'

# available languages list ->  https://huggingface.co/facebook/m2m100_1.2B   <-
source_language_code = 'ja'
target_language_code = 'es'

device = 'cpu'
# device = 'cuda'

# load data
string1 = 'イキリカメラマン'
string2 = 'おかあさん'
string3 = '人生はチョコレートの箱のようなものです。彼らは皆毒殺されています。'
list_to_translate = [string1, string2, string3]

# load model and tokenizer
translator = ctranslate2.Translator(model_path, device=device)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path, clean_up_tokenization_spaces=True)
tokenizer.src_lang = source_language_code

# invoke witchcraft
translated_list = [tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens.hypotheses[0][1:])) for tokens in translator.translate_batch([tokenizer.convert_ids_to_tokens(tokenizer.encode(i)) for i in list_to_translate], target_prefix=[[tokenizer.lang_code_to_token[target_language_code]]] * len(list_to_translate))]

# output
for text in translated_list:
    print(text)

📚 Documentation

Languages covered

The model supports the following languages: Afrikaans (af), Amharic (am), Arabic (ar), Asturian (ast), Azerbaijani (az), Bashkir (ba), Belarusian (be), Bulgarian (bg), Bengali (bn), Breton (br), Bosnian (bs), Catalan; Valencian (ca), Cebuano (ceb), Czech (cs), Welsh (cy), Danish (da), German (de), Greek (el), English (en), Spanish (es), Estonian (et), Persian (fa), Fulah (ff), Finnish (fi), French (fr), Western Frisian (fy), Irish (ga), Gaelic; Scottish Gaelic (gd), Galician (gl), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Croatian (hr), Haitian; Haitian Creole (ht), Hungarian (hu), Armenian (hy), Indonesian (id), Igbo (ig), Iloko (ilo), Icelandic (is), Italian (it), Japanese (ja), Javanese (jv), Georgian (ka), Kazakh (kk), Central Khmer (km), Kannada (kn), Korean (ko), Luxembourgish; Letzeburgesch (lb), Ganda (lg), Lingala (ln), Lao (lo), Lithuanian (lt), Latvian (lv), Malagasy (mg), Macedonian (mk), Malayalam (ml), Mongolian (mn), Marathi (mr), Malay (ms), Burmese (my), Nepali (ne), Dutch; Flemish (nl), Norwegian (no), Northern Sotho (ns), Occitan (post 1500) (oc), Oriya (or), Panjabi; Punjabi (pa), Polish (pl), Pushto; Pashto (ps), Portuguese (pt), Romanian; Moldavian; Moldovan (ro), Russian (ru), Sindhi (sd), Sinhala; Sinhalese (si), Slovak (sk), Slovenian (sl), Somali (so), Albanian (sq), Serbian (sr), Swati (ss), Sundanese (su), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th), Tagalog (tl), Tswana (tn), Turkish (tr), Ukrainian (uk), Urdu (ur), Uzbek (uz), Vietnamese (vi), Wolof (wo), Xhosa (xh), Yiddish (yi), Yoruba (yo), Chinese (zh), Zulu (zu)

🔧 Technical Details

The model is based on the facebook/m2m100_1.2B base model. It uses a multilingual encoder - decoder architecture for seq - to - seq translation.

📄 License

This project is licensed under the MIT license.

BibTeX entry and citation info

@misc{fan2020englishcentric,
      title={Beyond English-Centric Multilingual Machine Translation}, 
      author={Angela Fan and Shruti Bhosale and Holger Schwenk and Zhiyi Ma and Ahmed El-Kishky and Siddharth Goyal and Mandeep Baines and Onur Celebi and Guillaume Wenzek and Vishrav Chaudhary and Naman Goyal and Tom Birch and Vitaliy Liptchinsky and Sergey Edunov and Edouard Grave and Michael Auli and Armand Joulin},
      year={2020},
      eprint={2010.11125},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご