đ NLLB-200
NLLB-200 is a machine translation model. It focuses on research in machine - translation, especially for low - resource languages, enabling single - sentence translation among 200 languages.
đ Quick Start
The information on how to use the model can be found in the Fairseq code repository, along with the training code and references to evaluation and training data.
⨠Features
- Multilingual Translation: Capable of translating single sentences among 200 languages.
- Research - Oriented: Primarily designed for machine translation research, especially for low - resource languages.
đĻ Installation
No installation steps are provided in the original README, so this section is skipped.
đģ Usage Examples
No code examples are provided in the original README, so this section is skipped.
đ Documentation
- Metrics: The metrics for this particular checkpoint are available. The NLLB - 200 model was evaluated using BLEU, spBLEU, and chrF++ metrics widely adopted by the machine translation community. Additionally, human evaluation with the XSTS protocol was performed, and the toxicity of the generated translations was measured.
- Intended Use
- Primary Uses: Intended for research in machine translation, especially for low - resource languages.
- Primary Users: Researchers and the machine translation research community.
- Out - of - Scope Use Cases: It is a research model not for production deployment. It is trained on general - domain text data and not suitable for domain - specific texts (e.g., medical or legal). It is not for document translation. Translating sequences longer than 512 tokens may degrade quality, and its translations cannot be used as certified translations.
- Evaluation Data
- Datasets: The Flores - 200 dataset is used, described in Section 4.
- Motivation: It provides full evaluation coverage of the languages in NLLB - 200.
- Preprocessing: Sentence - split raw text data was preprocessed using SentencePiece, and the SentencePiece model is released with NLLB - 200.
- Training Data
- Parallel multilingual data from various sources was used. Details on data selection and construction are in Section 5 of the paper. Monolingual data constructed from Common Crawl was also used, with more details in Section 5.2.
- Ethical Considerations
- A reflexive approach was taken in technological development to prioritize human users and minimize risks. Many chosen languages are low - resource, especially African languages. Quality translation can improve education and information access, but it may also make less digitally - literate groups vulnerable to misinformation or scams. The training data was mined from public web sources, and although data cleaning was done, personally identifiable information may not be fully eliminated. Mistranslations may still occur and could have adverse impacts on decision - making.
- Caveats and Recommendations
- The model has been tested on the Wikimedia domain with limited investigation on other NLLB - MD supported domains. Supported languages may have variations that the model does not capture, and users should make appropriate assessments.
- Carbon Footprint Details
- The carbon dioxide (CO2e) estimate is reported in Section 8.8.
đ§ Technical Details
The exact training algorithm, data, and strategies to handle data imbalances for high and low - resource languages used to train NLLB - 200 are described in the paper: NLLB Team et al, No Language Left Behind: Scaling Human - Centered Machine Translation, Arxiv, 2022.
đ License
The license for this model is CC - BY - NC - 4.0.
Information Table
Property |
Details |
Model Type |
NLLB - 200's distilled 1.3B variant |
Training Data |
Parallel multilingual data from various sources and monolingual data from Common Crawl |
Datasets |
Flores - 200 |
Metrics |
BLEU, spBLEU, chrf++ |
License |
CC - BY - NC - 4.0 |
Important Notes
â ī¸ Important Note
NLLB - 200 is a research model and is not released for production deployment. It is trained on general domain text data and is not intended to be used with domain specific texts, such as medical domain or legal domain. The model is not intended to be used for document translation. The model was trained with input lengths not exceeding 512 tokens, therefore translating longer sequences might result in quality degradation. NLLB - 200 translations can not be used as certified translations.
đĄ Usage Tip
Our model has been tested on the Wikimedia domain with limited investigation on other domains supported in NLLB - MD. In addition, the supported languages may have variations that our model is not capturing. Users should make appropriate assessments.