Model Overview
Model Features
Model Capabilities
Use Cases
đ NLLB-200
This is a machine translation model that supports translation among 200 languages, mainly for research purposes, especially in the field of low - resource languages.
đ Quick Start
This model can be downloaded and run using Python. For detailed steps, please refer to the following sections.
⨠Features
- Supports single - sentence translation among 200 languages.
- Evaluated using multiple metrics such as BLEU, spBLEU, and chrF++.
- Trained on the Flores - 200 dataset.
đĻ Installation
Download the model
- Install Python: https://www.python.org/downloads/
- Open the command prompt (
cmd
) - Check Python version:
python --version
- Install the
huggingface_hub
library:python -m pip install huggingface_hub
- Run Python and execute the following code:
import huggingface_hub
huggingface_hub.download_snapshot('entai2965/nllb-300-3.3B-ctranslate2', local_dir='nllb-300-3.3B-ctranslate2')
Install dependencies for running the model
- Open the command prompt (
cmd
) - Install
ctranslate2
andtransformers
libraries:python -m pip install ctranslate2 transformers
đģ Usage Examples
Basic Usage
import ctranslate2
import transformers
src_lang = "eng_Latn"
tgt_lang = "fra_Latn"
translator = ctranslate2.Translator("nllb-200-3.3B-ctranslate2", device='cpu')
tokenizer = transformers.AutoTokenizer.from_pretrained("nllb-200-3.3B-ctranslate2", src_lang=src_lang, clean_up_tokenization_spaces=True)
source = tokenizer.convert_ids_to_tokens(tokenizer.encode("Hello world!"))
target_prefix = [tgt_lang]
results = translator.translate_batch([source], target_prefix=[target_prefix])
target = results[0].hypotheses[0][1:]
print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))
Advanced Usage
Batch translation
import os
import ctranslate2
import transformers
# set defaults
home_path = os.path.expanduser('~')
model_folder = home_path + '/Downloads/models/nllb-200-3.3B-ctranslate2' # 13 GB of memory
string1 = 'Hello world!'
string2 = 'Awesome.'
raw_list = [string1, string2]
# https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200
source_language_code = "eng_Latn"
target_language_code = "fra_Latn"
device = 'cpu'
# device = 'cuda'
# load models
translator = ctranslate2.Translator(model_folder, device=device)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_folder, src_lang=source_language_code, clean_up_tokenization_spaces=True)
# tokenize input
encoded_list = []
for text in raw_list:
encoded_list.append(tokenizer.convert_ids_to_tokens(tokenizer.encode(text)))
# translate
# https://opennmt.net/CTranslate2/python/ctranslate2.Translator.html?#ctranslate2.Translator.translate_batch
translated_list = translator.translate_batch(encoded_list, target_prefix=[[target_language_code]] * len(raw_list))
assert (len(raw_list) == len(translated_list))
# decode
for counter, tokens in enumerate(translated_list):
translated_list[counter] = tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens.hypotheses[0][1:]))
# output
for text in translated_list:
print(text)
Functional programming version
import os
import ctranslate2
import transformers
# set defaults
home_path = os.path.expanduser('~')
model_folder = home_path + '/Downloads/models/nllb-200-3.3B-ctranslate2' # 13 GB of memory
string1 = 'Hello world!'
string2 = 'Awesome.'
raw_list = [string1, string2]
# https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200
source_language_code = "eng_Latn"
target_language_code = "fra_Latn"
device = 'cpu'
# device = 'cuda'
# load models
translator = ctranslate2.Translator(model_folder, device=device)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_folder, src_lang=source_language_code, clean_up_tokenization_spaces=True)
# invoke black magic
translated_list = [tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens.hypotheses[0][1:])) for tokens in
translator.translate_batch([tokenizer.convert_ids_to_tokens(tokenizer.encode(text)) for text in raw_list],
target_prefix=[[target_language_code]] * len(raw_list))]
assert (len(raw_list) == len(translated_list))
# output
for text in translated_list:
print(text)
đ Documentation
Intended Use
- Primary intended uses: NLLB - 200 is a machine translation model primarily intended for research in machine translation, especially for low - resource languages. It allows for single - sentence translation among 200 languages. Information on how to use the model can be found in the Fairseq code repository along with the training code and references to evaluation and training data.
- Primary intended users: The primary users are researchers and the machine translation research community.
- Out - of - scope use cases: NLLB - 200 is a research model and is not released for production deployment. It is trained on general - domain text data and is not intended to be used with domain - specific texts, such as medical or legal domain texts. The model is not intended for document translation. Since the model was trained with input lengths not exceeding 512 tokens, translating longer sequences might result in quality degradation. NLLB - 200 translations cannot be used as certified translations.
Metrics
The NLLB - 200 model was evaluated using BLEU, spBLEU, and chrF++ metrics widely adopted by the machine translation community. Additionally, human evaluation was performed with the XSTS protocol, and the toxicity of the generated translations was measured.
Evaluation Data
- Datasets: The Flores - 200 dataset is described in Section 4.
- Motivation: Flores - 200 was used because it provides full evaluation coverage of the languages in NLLB - 200.
- Preprocessing: Sentence - split raw text data was preprocessed using SentencePiece. The SentencePiece model is released along with NLLB - 200.
Training Data
Parallel multilingual data from various sources were used to train the model. A detailed report on data selection and construction process is provided in Section 5 of the paper. Monolingual data constructed from Common Crawl were also used, with more details in Section 5.2.
Ethical Considerations
In this work, a reflexive approach was taken in technological development to prioritize human users and minimize risks transferred to them. Many languages chosen for this study are low - resource languages, especially African languages. While quality translation can improve education and information access in these communities, it may also make groups with lower digital literacy more vulnerable to misinformation or online scams if bad actors misappropriate the work. Regarding data acquisition, the training data were mined from various publicly available web sources. Although much effort was put into data cleaning, personally identifiable information may not be entirely eliminated. Finally, despite efforts to optimize translation quality, mistranslations may still occur, which could have adverse impacts on those relying on these translations for important decisions (especially related to health and safety).
Caveats and Recommendations
The model has been tested on the Wikimedia domain with limited investigation on other domains supported in NLLB - MD. In addition, the supported languages may have variations that the model does not capture. Users should make appropriate assessments.
Carbon Footprint Details
The carbon dioxide (CO2e) estimate is reported in Section 8.8.
đ§ Technical Details
- Information about training algorithms, parameters, fairness constraints or other applied approaches, and features. The exact training algorithm, data, and the strategies to handle data imbalances for high - and low - resource languages used to train NLLB - 200 are described in the paper.
- Paper or other resource for more information: NLLB Team et al, No Language Left Behind: Scaling Human - Centered Machine Translation, Arxiv, 2022
đ License
The license of this model is CC - BY - NC.
Available languages
You can find the list of available languages here: https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200
ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab,
aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab,
asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl,
bam_Latn, ban_Latn,bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn,
bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn,
cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn,
dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn,
ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn,
fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr,
hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn,
hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn,
â ī¸ Important Note
NLLB - 200 is a research model and is not released for production deployment. It is trained on general - domain text data and is not suitable for domain - specific texts. Translating sequences longer than 512 tokens may lead to quality degradation, and the translations cannot be used as certified translations.
đĄ Usage Tip
When using the model, make sure to select appropriate source and target languages according to the available language list. Also, pay attention to the memory requirements when running the model.

