Nllb 200 3.3B Ct2 Int8
Model Overview
Model Features
Model Capabilities
Use Cases
đ Fast-Inference with Ctranslate2
This project speeds up inference and reduces memory usage by 2x - 4x using int8 inference in C++ on CPU or GPU. It's a quantized version of facebook/nllb-200-3.3B.
đ Quick Start
Installation
pip install ctranslate2
Compatibility
This checkpoint is compatible with ctranslate2>=3.22.0.
- Use
compute_type=int8_float16
fordevice="cuda"
. - Use
compute_type=int8
fordevice="cpu"
.
Conversion
The conversion was done on 2023 - 12 - 01 using CTranslate2==3.22.0:
from ctranslate2.converters import TransformersConverter
TransformersConverter(
"facebook/nllb-200-3.3B",
activation_scales=None,
copy_files=['tokenizer.json', 'generation_config.json', 'README.md', 'special_tokens_map.json', 'tokenizer_config.json', '.gitattributes'],
load_as_float16=True,
revision=None,
low_cpu_mem_usage=True,
trust_remote_code=True,
).convert(
output_dir=str(tmp_dir),
vmap = None,
quantization="int8",
force = True,
)
⨠Features
Supported Languages
- ace, acm, acq, aeb, af, ajp, ak, als, am, apc, ar, ars, ary, arz, as, ast, awa, ayr, azb, azj, ba, bm, ban, be, bem, bn, bho, bjn, bo, bs, bug, bg, ca, ceb, cs, cjk, ckb, crh, cy, da, de, dik, dyu, dz, el, en, eo, et, eu, ee, fo, fj, fi, fon, fr, fur, fuv, gaz, gd, ga, gl, gn, gu, ht, ha, he, hi, hne, hr, hu, hy, ig, ilo, id, is, it, jv, ja, kab, kac, kam, kn, ks, ka, kk, kbp, kea, khk, km, ki, rw, ky, kmb, kmr, knc, kg, ko, lo, lij, li, ln, lt, lmo, ltg, lb, lua, lg, luo, lus, lvs, mag, mai, ml, mar, min, mk, mt, mni, mos, mi, my, nl, nn, nb, npi, nso, nus, ny, oc, ory, pag, pa, pap, pbt, pes, plt, pl, pt, prs, quy, ro, rn, ru, sg, sa, sat, scn, shn, si, sk, sl, sm, sn, sd, so, st, es, sc, sr, ss, su, sv, swh, szl, ta, taq, tt, te, tg, tl, th, ti, tpi, tn, ts, tk, tum, tr, tw, tzm, ug, uk, umb, ur, uzn, vec, vi, war, wo, xh, ydd, yo, yue, zh, zsm, zu
Language Details
ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab, aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab, asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl, bam_Latn, ban_Latn,bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn, bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn, cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn, dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn, ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn, fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr, hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn, hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn, jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva, kat_Geor, knc_Arab, knc_Latn, kaz_Cyrl, kbp_Latn, kea_Latn, khm_Khmr, kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kon_Latn, kor_Hang, kmr_Latn, lao_Laoo, lvs_Latn, lij_Latn, lim_Latn, lin_Latn, lit_Latn, lmo_Latn, ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn, mag_Deva, mai_Deva, mal_Mlym, mar_Deva, min_Latn, mkd_Cyrl, plt_Latn, mlt_Latn, mni_Beng, khk_Cyrl, mos_Latn, mri_Latn, zsm_Latn, mya_Mymr, nld_Latn, nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn, oci_Latn, gaz_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pol_Latn, por_Latn, prs_Arab, pbt_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl, sag_Latn, san_Deva, sat_Beng, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn, slv_Latn, smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn, als_Latn, srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn, szl_Latn, tam_Taml, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai, tir_Ethi, taq_Latn, taq_Tfng, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn, tum_Latn, tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn, urd_Arab, uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn, ydd_Hebr, yor_Latn, yue_Hant, zho_Hans, zho_Hant, zul_Latn
Tags
- ctranslate2
- int8
- float16
- nllb
- translation
Datasets
- flores - 200
Metrics
- bleu
- spbleu
- chrf++
Inference
Inference is disabled.
đ License
This is just a quantized version. License conditions are intended to be identical to the original Hugging Face repo, which is under the CC - BY - NC - 4.0 license.
đ Documentation
Original Description of NLLB - 200
This is the model card of NLLB - 200's 3.3B variant. Here are the metrics for that particular checkpoint.
Information about the Model
- Information about training algorithms, parameters, fairness constraints or other applied approaches, and features. The exact training algorithm, data and the strategies to handle data imbalances for high and low resource languages that were used to train NLLB - 200 is described in the paper.
- Paper or other resource for more information: NLLB Team et al, No Language Left Behind: Scaling Human - Centered Machine Translation, Arxiv, 2022
- License: CC - BY - NC
- Where to send questions or comments about the model: https://github.com/facebookresearch/fairseq/issues
Intended Use
- Primary intended uses: NLLB - 200 is a machine translation model primarily intended for research in machine translation, especially for low - resource languages. It allows for single sentence translation among 200 languages. Information on how to use the model can be found in the Fairseq code repository along with the training code and references to evaluation and training data.
- Primary intended users: Primary users are researchers and the machine translation research community.
- Out - of - scope use cases: NLLB - 200 is a research model and is not released for production deployment. NLLB - 200 is trained on general domain text data and is not intended to be used with domain specific texts, such as medical domain or legal domain. The model is not intended to be used for document translation. The model was trained with input lengths not exceeding 512 tokens, therefore translating longer sequences might result in quality degradation. NLLB - 200 translations can not be used as certified translations.
Metrics
- Model performance measures: NLLB - 200 model was evaluated using BLEU, spBLEU, and chrF++ metrics widely adopted by the machine translation community. Additionally, human evaluation with the XSTS protocol was performed and the toxicity of the generated translations was measured.
Evaluation Data
- Datasets: The Flores - 200 dataset is described in Section 4.
- Motivation: Flores - 200 was used as it provides full evaluation coverage of the languages in NLLB - 200.
- Preprocessing: Sentence - split raw text data was preprocessed using SentencePiece. The SentencePiece model is released along with NLLB - 200.
Training Data
- Parallel multilingual data from a variety of sources was used to train the model. A detailed report on data selection and construction process is provided in Section 5 in the paper. Monolingual data constructed from Common Crawl was also used, with more details provided in Section 5.2.
Ethical Considerations
- In this work, a reflexive approach in technological development was taken to ensure that human users are prioritized and risks transferred to them are minimized. While ethical considerations are reflected throughout the article, some additional points are highlighted. Many languages chosen for this study are low - resource languages, with a heavy emphasis on African languages. While quality translation could improve education and information access in many of these communities, such access could also make groups with lower levels of digital literacy more vulnerable to misinformation or online scams. These scenarios could arise if bad actors misappropriate the work for nefarious activities, which is considered an example of unintended use. Regarding data acquisition, the training data used for model development were mined from various publicly available sources on the web. Although significant effort was invested in data cleaning, personally identifiable information may not be entirely eliminated. Finally, although efforts were made to optimize for translation quality, mistranslations produced by the model could remain. Although the odds are low, this could have an adverse impact on those who rely on these translations to make important decisions (particularly when related to health and safety).
Caveats and Recommendations
- The model has been tested on the Wikimedia domain with limited investigation on other domains supported in NLLB - MD. In addition, the supported languages may have variations that the model is not capturing. Users should make appropriate assessments.
Carbon Footprint Details
- The carbon dioxide (CO2e) estimate is reported in Section 8.8.

