đ NLLB-200
This is a machine translation model that enables single - sentence translation across 200 languages, mainly serving the research of machine translation, especially for low - resource languages.
Model Information
Property |
Details |
Base Model |
facebook/nllb-200-distilled-600M |
Pipeline Tag |
translation |
Tags |
nllb |
License |
cc - by - nc - 4.0 |
Datasets |
flores - 200 |
Metrics |
bleu, spbleu, chrf++ |
Inference |
false |
Language Support
The model supports the following languages:
ace, acm, acq, aeb, af, ajp, ak, als, am, apc, ar, ars, ary, arz, as, ast, awa, ayr, azb, azj, ba, bm, ban, be, bem, bn, bho, bjn, bo, bs, bug, bg, ca, ceb, cs, cjk, ckb, crh, cy, da, de, dik, dyu, dz, el, en, eo, et, eu, ee, fo, fj, fi, fon, fr, fur, fuv, gaz, gd, ga, gl, gn, gu, ht, ha, he, hi, hne, hr, hu, hy, ig, ilo, id, is, it, jv, ja, kab, kac, kam, kn, ks, ka, kk, kbp, kea, khk, km, ki, rw, ky, kmb, kmr, knc, kg, ko, lo, lij, li, ln, lt, lmo, ltg, lb, lua, lg, luo, lus, lvs, mag, mai, ml, mar, min, mk, mt, mni, mos, mi, my, nl, nn, nb, npi, nso, nus, ny, oc, ory, pag, pa, pap, pbt, pes, plt, pl, pt, prs, quy, ro, rn, ru, sg, sa, sat, scn, shn, si, sk, sl, sm, sn, sd, so, st, es, sc, sr, ss, su, sv, swh, szl, ta, taq, tt, te, tg, tl, th, ti, tpi, tn, ts, tk, tum, tr, tw, tzm, ug, uk, umb, ur, uzn, vec, vi, war, wo, xh, ydd, yo, yue, zh, zsm, zu
Language details:
ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab, aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab, asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl, bam_Latn, ban_Latn,bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn, bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn, cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn, dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn, ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn, fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr, hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn, hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn, jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva, kat_Geor, knc_Arab, knc_Latn, kaz_Cyrl, kbp_Latn, kea_Latn, khm_Khmr, kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kon_Latn, kor_Hang, kmr_Latn, lao_Laoo, lvs_Latn, lij_Latn, lim_Latn, lin_Latn, lit_Latn, lmo_Latn, ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn, mag_Deva, mai_Deva, mal_Mlym, mar_Deva, min_Latn, mkd_Cyrl, plt_Latn, mlt_Latn, mni_Beng, khk_Cyrl, mos_Latn, mri_Latn, zsm_Latn, mya_Mymr, nld_Latn, nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn, oci_Latn, gaz_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pol_Latn, por_Latn, prs_Arab, pbt_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl, sag_Latn, san_Deva, sat_Beng, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn, slv_Latn, smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn, als_Latn, srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn, szl_Latn, tam_Taml, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai, tir_Ethi, taq_Latn, taq_Tfng, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn, tum_Latn, tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn, urd_Arab, uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn, ydd_Hebr, yor_Latn, yue_Hant, zho_Hans, zho_Hant, zul_Latn
đ Quick Start
Here are the metrics for that particular checkpoint. Information on how to use the model can be found in Fairseq code repository along with the training code and references to evaluation and training data.
⨠Features
- It allows for single - sentence translation among 200 languages.
- The model was evaluated using BLEU, spBLEU, and chrF++ metrics widely adopted by the machine - translation community. Additionally, human evaluation with the XSTS protocol was performed, and the toxicity of the generated translations was measured.
đ Documentation
Intended Use
- Primary intended uses: NLLB - 200 is mainly for machine - translation research, especially for low - resource languages.
- Primary intended users: Researchers and the machine - translation research community.
- Out - of - scope use cases: It's a research model not for production deployment. It's trained on general - domain text data and not suitable for domain - specific texts (e.g., medical or legal domains), document translation. Translating sequences over 512 tokens may degrade quality, and its translations can't be used as certified translations.
Evaluation Data
- Datasets: The Flores - 200 dataset is used, as described in Section 4.
- Motivation: It provides full evaluation coverage of the languages in NLLB - 200.
- Preprocessing: Sentence - split raw text data was preprocessed using SentencePiece, and the SentencePiece model is released with NLLB - 200.
Training Data
Parallel multilingual data from various sources were used to train the model. A detailed report on data selection and construction is in Section 5 of the paper. Monolingual data from Common Crawl were also used, with more details in Section 5.2.
Ethical Considerations
- Many chosen languages are low - resource, mainly African languages. While it can improve education and information access, it may also make less digitally literate groups vulnerable to misinformation or scams.
- Training data were mined from public web sources. Despite data cleaning, personally identifiable information may remain.
- Although translation quality was optimized, mistranslations could still occur, which might have adverse impacts on decision - making.
Caveats and Recommendations
The model has been tested on the Wikimedia domain with limited investigation on other domains supported in NLLB - MD. Also, the supported languages may have variations that the model doesn't capture, so users should make appropriate assessments.
Carbon Footprint Details
The carbon dioxide (CO2e) estimate is reported in Section 8.8.
đ License
The model is licensed under CC - BY - NC. Questions or comments about the model can be sent to https://github.com/facebookresearch/fairseq/issues.
đ References
NLLB Team et al, No Language Left Behind: Scaling Human - Centered Machine Translation, Arxiv, 2022