Model Overview
Model Features
Model Capabilities
Use Cases
đ NLLB-200
This is a machine translation model that aims to support translation among 200 languages, especially beneficial for low - resource languages research. It allows single - sentence translation across a wide range of languages, promoting the development of machine translation technology.
đ Quick Start
Download the Model
- Install Python from here.
- Open the
cmd
and run the following commands:- Check Python version:
python --version
- Install the
huggingface_hub
library:python -m pip install huggingface_hub
- Open the Python interpreter:
python
- Check Python version:
import huggingface_hub
huggingface_hub.download_snapshot('entai2965/nllb-200-distilled-600M-ctranslate2', local_dir='nllb-200-distilled-600M-ctranslate2')
Run the Model
- Refer to the guide here.
- Open the
cmd
and run:python -m pip install ctranslate2 transformers
- Open the Python interpreter:
python
import ctranslate2
import transformers
src_lang = "eng_Latn"
tgt_lang = "fra_Latn"
translator = ctranslate2.Translator("nllb-200-distilled-600M-ctranslate2", device="cpu")
tokenizer = transformers.AutoTokenizer.from_pretrained("nllb-200-distilled-600M-ctranslate2", src_lang=src_lang, clean_up_tokenization_spaces=True)
source = tokenizer.convert_ids_to_tokens(tokenizer.encode("Hello world!"))
target_prefix = [tgt_lang]
results = translator.translate_batch([source], target_prefix=[target_prefix])
target = results[0].hypotheses[0][1:]
print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))
Run the Model in Batch Syntax
import os
import ctranslate2
import transformers
#set defaults
home_path = os.path.expanduser('~')
model_folder = home_path + '/Downloads/models/nllb-200-distilled-600M-ctranslate2' # 3 GB of memory
#model_folder = home_path + '/Downloads/models/nllb-200-distilled-1.3B-ctranslate2' # 5.5 GB of memory
#model_folder = home_path + '/Downloads/models/nllb-200-3.3B-ctranslate2-float16' # 13 GB of memory in almost all cases, 7.6 GB on CUDA + GeForce RTX 2000 series and newer
#model_folder = home_path + '/Downloads/models/nllb-200-3.3B-ctranslate2' # 13 GB of memory
string1 = 'Hello world!'
string2 = 'Awesome.'
raw_list = [string1, string2]
#https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200
source_language_code = "eng_Latn"
target_language_code = "fra_Latn"
device = 'cpu'
#device = 'cuda'
#load models
translator = ctranslate2.Translator(model_folder, device=device)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_folder, src_lang=source_language_code, clean_up_tokenization_spaces=True)
#tokenize input
encoded_list = []
for text in raw_list:
encoded_list.append(tokenizer.convert_ids_to_tokens(tokenizer.encode(text)))
#translate
#https://opennmt.net/CTranslate2/python/ctranslate2.Translator.html?#ctranslate2.Translator.translate_batch
translated_list = translator.translate_batch(encoded_list, target_prefix=[[target_language_code]] * len(raw_list))
assert(len(raw_list) == len(translated_list))
#decode
for counter, tokens in enumerate(translated_list):
translated_list[counter] = tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens.hypotheses[0][1:]))
#output
for text in translated_list:
print(text)
Functional Programming Version
import os
import ctranslate2
import transformers
#set defaults
home_path = os.path.expanduser('~')
model_folder = home_path + '/Downloads/models/nllb-200-distilled-600M-ctranslate2' # 3 GB of memory
#model_folder = home_path + '/Downloads/models/nllb-200-distilled-1.3B-ctranslate2' # 5.5 GB of memory
#model_folder = home_path + '/Downloads/models/nllb-200-3.3B-ctranslate2-float16' # 13 GB of memory in almost all cases, 7.6 GB on CUDA + GeForce RTX 2000 series and newer
#model_folder = home_path + '/Downloads/models/nllb-200-3.3B-ctranslate2' # 13 GB of memory
string1 = 'Hello world!'
string2 = 'Awesome.'
raw_list = [string1, string2]
#https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200
source_language_code = "eng_Latn"
target_language_code = "fra_Latn"
device = 'cpu'
#device = 'cuda'
#load models
translator = ctranslate2.Translator(model_folder, device=device)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_folder, src_lang=source_language_code, clean_up_tokenization_spaces=True)
#invoke black magic
translated_list = [tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens.hypotheses[0][1:])) for tokens in translator.translate_batch([tokenizer.convert_ids_to_tokens(tokenizer.encode(text)) for text in raw_list], target_prefix=[[target_language_code]] * len(raw_list))]
assert(len(raw_list) == len(translated_list))
#output
for text in translated_list:
print(text)
⨠Features
- Supports single - sentence translation among 200 languages, facilitating research in machine translation, especially for low - resource languages.
đ Documentation
Intended Use
- Primary intended uses: NLLB - 200 is a machine translation model mainly for research in machine translation, especially for low - resource languages. It enables single - sentence translation among 200 languages. Usage information can be found in the Fairseq code repository, along with training code and references to evaluation and training data.
- Primary intended users: Researchers and the machine translation research community.
- Out - of - scope use cases: It is a research model not for production deployment. Trained on general - domain text data, it's not suitable for domain - specific texts (e.g., medical or legal domains) or document translation. With a maximum input length of 512 tokens, translating longer sequences may degrade quality. Its translations can't be used as certified translations.
Metrics
- Model performance measures: The NLLB - 200 model was evaluated using BLEU, spBLEU, and chrF++ metrics widely adopted in the machine translation community. Additionally, human evaluation with the XSTS protocol was performed, and the toxicity of generated translations was measured.
Evaluation Data
- Datasets: The Flores - 200 dataset, described in Section 4, was used.
- Motivation: It provides full evaluation coverage of the languages in NLLB - 200.
- Preprocessing: Sentence - split raw text data was preprocessed using SentencePiece, and the SentencePiece model was released with NLLB - 200.
Training Data
Parallel multilingual data from various sources was used to train the model. Details on data selection and construction are provided in Section 5 of the paper. Monolingual data constructed from Common Crawl was also used, with more details in Section 5.2.
Ethical Considerations
In this work, a reflexive approach was taken in technological development to prioritize human users and minimize risks. Many languages in the study are low - resource languages, especially African languages. While quality translation can improve education and information access, it may also make less digitally literate groups more vulnerable to misinformation or online scams. Training data was mined from public web sources, and although data cleaning was extensive, personally identifiable information may not be completely eliminated. Mistranslations, though rare, could have adverse impacts on decision - making related to health and safety.
Caveats and Recommendations
The model has been tested on the Wikimedia domain with limited investigation on other NLLB - MD supported domains. Supported languages may have variations not captured by the model, so users should make appropriate assessments.
Carbon Footprint Details
The carbon dioxide (CO2e) estimate is reported in Section 8.8.
đ§ Technical Details
- Information about training algorithms, parameters, fairness constraints, and other applied approaches is described in the paper: NLLB Team et al, No Language Left Behind: Scaling Human - Centered Machine Translation, Arxiv, 2022.
đ License
The license for this model is CC - BY - NC.
Available languages
Refer to here for available languages.
ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab,
aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab,
asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl,
bam_Latn, ban_Latn, bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn,
bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn,
cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn,
dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn,
ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn,
fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr,
hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn,
hye_Armn
From: https://huggingface.co/facebook/nllb-200-distilled-600M
Here are the metrics for that particular checkpoint.
For questions or comments about the model, please visit here.

