nllb-200-3.3B-ct2-int8 Open-source Multilingual Processing Model - Supports Over 100 Languages and Writing Systems

Nllb 200 3.3B Ct2 Int8

Developed by OpenNMT

A multilingual processing model supporting over 100 languages and writing systems, covering from mainstream languages to various dialects and minority languages

Large Language Model

Transformers

Supports Multiple Languages#Multilingual Support #Cross-language Translation #Low-resource Language Processing

Downloads 65

Release Time : 11/30/2023

Model Overview

This model focuses on multilingual text processing, supporting languages with various writing systems including Arabic, Latin, Cyrillic, Tibetan, etc. It is suitable for tasks such as translation, text classification, and information extraction

Model Features

Extensive Language Coverage

Supports over 100 languages, including various dialects and minority languages

Multi-script System Support

Capable of processing multiple writing systems such as Arabic, Latin, Cyrillic, Tibetan, etc.

Cultural Adaptability

The model considers cultural backgrounds and expression habits of different languages (inferred)

Model Capabilities

Multilingual Text Understanding

Cross-language Information Extraction

Language Identification

Text Classification

Basic Machine Translation Support

Use Cases

Global Applications

Multilingual Content Management

Helps enterprises manage and categorize multilingual content

Improves content management efficiency and supports global business expansion

Localization Service Support

Provides foundational language support for localization services

Reduces localization costs and improves service quality

Academic Research

Minority Language Preservation

Supports digital processing and research of minority languages

Promotes the protection of linguistic diversity

🚀 Fast-Inference with Ctranslate2

This project speeds up inference and reduces memory usage by 2x - 4x using int8 inference in C++ on CPU or GPU. It's a quantized version of facebook/nllb-200-3.3B.

🚀 Quick Start

Installation

pip install ctranslate2

Compatibility

This checkpoint is compatible with ctranslate2>=3.22.0.

Use compute_type=int8_float16 for device="cuda".
Use compute_type=int8 for device="cpu".

Conversion

The conversion was done on 2023 - 12 - 01 using CTranslate2==3.22.0:

from ctranslate2.converters import TransformersConverter
TransformersConverter(
    "facebook/nllb-200-3.3B",
    activation_scales=None,
    copy_files=['tokenizer.json', 'generation_config.json', 'README.md', 'special_tokens_map.json', 'tokenizer_config.json', '.gitattributes'],
    load_as_float16=True,
    revision=None,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
).convert(
    output_dir=str(tmp_dir),
    vmap = None, 
    quantization="int8",
    force = True,
)

✨ Features

Supported Languages

ace, acm, acq, aeb, af, ajp, ak, als, am, apc, ar, ars, ary, arz, as, ast, awa, ayr, azb, azj, ba, bm, ban, be, bem, bn, bho, bjn, bo, bs, bug, bg, ca, ceb, cs, cjk, ckb, crh, cy, da, de, dik, dyu, dz, el, en, eo, et, eu, ee, fo, fj, fi, fon, fr, fur, fuv, gaz, gd, ga, gl, gn, gu, ht, ha, he, hi, hne, hr, hu, hy, ig, ilo, id, is, it, jv, ja, kab, kac, kam, kn, ks, ka, kk, kbp, kea, khk, km, ki, rw, ky, kmb, kmr, knc, kg, ko, lo, lij, li, ln, lt, lmo, ltg, lb, lua, lg, luo, lus, lvs, mag, mai, ml, mar, min, mk, mt, mni, mos, mi, my, nl, nn, nb, npi, nso, nus, ny, oc, ory, pag, pa, pap, pbt, pes, plt, pl, pt, prs, quy, ro, rn, ru, sg, sa, sat, scn, shn, si, sk, sl, sm, sn, sd, so, st, es, sc, sr, ss, su, sv, swh, szl, ta, taq, tt, te, tg, tl, th, ti, tpi, tn, ts, tk, tum, tr, tw, tzm, ug, uk, umb, ur, uzn, vec, vi, war, wo, xh, ydd, yo, yue, zh, zsm, zu

Language Details

ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab, aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab, asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl, bam_Latn, ban_Latn,bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn, bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn, cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn, dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn, ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn, fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr, hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn, hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn, jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva, kat_Geor, knc_Arab, knc_Latn, kaz_Cyrl, kbp_Latn, kea_Latn, khm_Khmr, kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kon_Latn, kor_Hang, kmr_Latn, lao_Laoo, lvs_Latn, lij_Latn, lim_Latn, lin_Latn, lit_Latn, lmo_Latn, ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn, mag_Deva, mai_Deva, mal_Mlym, mar_Deva, min_Latn, mkd_Cyrl, plt_Latn, mlt_Latn, mni_Beng, khk_Cyrl, mos_Latn, mri_Latn, zsm_Latn, mya_Mymr, nld_Latn, nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn, oci_Latn, gaz_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pol_Latn, por_Latn, prs_Arab, pbt_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl, sag_Latn, san_Deva, sat_Beng, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn, slv_Latn, smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn, als_Latn, srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn, szl_Latn, tam_Taml, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai, tir_Ethi, taq_Latn, taq_Tfng, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn, tum_Latn, tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn, urd_Arab, uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn, ydd_Hebr, yor_Latn, yue_Hant, zho_Hans, zho_Hant, zul_Latn

Datasets

flores - 200

Metrics

bleu
spbleu
chrf++

Inference

Inference is disabled.

📄 License

This is just a quantized version. License conditions are intended to be identical to the original Hugging Face repo, which is under the CC - BY - NC - 4.0 license.

📚 Documentation

Original Description of NLLB - 200

This is the model card of NLLB - 200's 3.3B variant. Here are the metrics for that particular checkpoint.

Information about the Model

Information about training algorithms, parameters, fairness constraints or other applied approaches, and features. The exact training algorithm, data and the strategies to handle data imbalances for high and low resource languages that were used to train NLLB - 200 is described in the paper.
Paper or other resource for more information: NLLB Team et al, No Language Left Behind: Scaling Human - Centered Machine Translation, Arxiv, 2022
License: CC - BY - NC
Where to send questions or comments about the model: https://github.com/facebookresearch/fairseq/issues

Intended Use

Primary intended uses: NLLB - 200 is a machine translation model primarily intended for research in machine translation, especially for low - resource languages. It allows for single sentence translation among 200 languages. Information on how to use the model can be found in the Fairseq code repository along with the training code and references to evaluation and training data.
Primary intended users: Primary users are researchers and the machine translation research community.
Out - of - scope use cases: NLLB - 200 is a research model and is not released for production deployment. NLLB - 200 is trained on general domain text data and is not intended to be used with domain specific texts, such as medical domain or legal domain. The model is not intended to be used for document translation. The model was trained with input lengths not exceeding 512 tokens, therefore translating longer sequences might result in quality degradation. NLLB - 200 translations can not be used as certified translations.

Metrics

Model performance measures: NLLB - 200 model was evaluated using BLEU, spBLEU, and chrF++ metrics widely adopted by the machine translation community. Additionally, human evaluation with the XSTS protocol was performed and the toxicity of the generated translations was measured.

Evaluation Data

Datasets: The Flores - 200 dataset is described in Section 4.
Motivation: Flores - 200 was used as it provides full evaluation coverage of the languages in NLLB - 200.
Preprocessing: Sentence - split raw text data was preprocessed using SentencePiece. The SentencePiece model is released along with NLLB - 200.

Training Data

Parallel multilingual data from a variety of sources was used to train the model. A detailed report on data selection and construction process is provided in Section 5 in the paper. Monolingual data constructed from Common Crawl was also used, with more details provided in Section 5.2.

Ethical Considerations

In this work, a reflexive approach in technological development was taken to ensure that human users are prioritized and risks transferred to them are minimized. While ethical considerations are reflected throughout the article, some additional points are highlighted. Many languages chosen for this study are low - resource languages, with a heavy emphasis on African languages. While quality translation could improve education and information access in many of these communities, such access could also make groups with lower levels of digital literacy more vulnerable to misinformation or online scams. These scenarios could arise if bad actors misappropriate the work for nefarious activities, which is considered an example of unintended use. Regarding data acquisition, the training data used for model development were mined from various publicly available sources on the web. Although significant effort was invested in data cleaning, personally identifiable information may not be entirely eliminated. Finally, although efforts were made to optimize for translation quality, mistranslations produced by the model could remain. Although the odds are low, this could have an adverse impact on those who rely on these translations to make important decisions (particularly when related to health and safety).

Caveats and Recommendations

The model has been tested on the Wikimedia domain with limited investigation on other domains supported in NLLB - MD. In addition, the supported languages may have variations that the model is not capturing. Users should make appropriate assessments.

Carbon Footprint Details

The carbon dioxide (CO2e) estimate is reported in Section 8.8.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Nllb 200 3.3B Ct2 Int8

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 Fast-Inference with Ctranslate2

🚀 Quick Start

Installation

Compatibility

Conversion

✨ Features

Supported Languages

Language Details

Tags

Datasets

Metrics

Inference

📄 License

📚 Documentation

Original Description of NLLB - 200

Information about the Model

Intended Use

Metrics

Evaluation Data

Training Data

Ethical Considerations

Caveats and Recommendations

Carbon Footprint Details