nllb-200-distilled-1.3B Open Source Multilingual Processing Model

Home

Nllb 200 Distilled 1.3B

Developed by facebook

A multilingual processing model collection supporting over 100 languages and writing systems

Large Language Model

Transformers

Supports Multiple Languages#Extensive Multilingual Support #Mixed Writing Systems #Low-Resource Language Optimization

Downloads 117.90k

Release Time : 7/8/2022

Model Overview

This model collection includes processing capabilities for multiple languages and writing systems, covering broad support from mainstream to regional languages, suitable for various natural language processing tasks such as text processing and translation.

Model Features

Extensive Language Support

Supports over 100 languages and writing systems, including multiple Arabic dialects and regional languages

Multi-Writing System Processing

Capable of handling Latin, Arabic, Cyrillic, Devanagari, and other writing systems

Dialect Support

Specifically includes multiple Arabic dialects (Egyptian, Sudanese, Moroccan, etc.) and regional variants

Model Capabilities

Multilingual text processing

Language identification

Text translation

Cross-language information retrieval

Use Cases

Translation Services

Multilingual Document Translation

Translate documents between different languages, supporting rare language pairs

Achieves high-accuracy cross-language communication

Content Localization

Regional Content Adaptation

Provide localized content for users in different regions, considering dialect differences

Enhances content acceptance among regional users

🚀 NLLB-200

NLLB-200 is a machine translation model. It focuses on research in machine - translation, especially for low - resource languages, enabling single - sentence translation among 200 languages.

🚀 Quick Start

The information on how to use the model can be found in the Fairseq code repository, along with the training code and references to evaluation and training data.

✨ Features

Multilingual Translation: Capable of translating single sentences among 200 languages.
Research - Oriented: Primarily designed for machine translation research, especially for low - resource languages.

📦 Installation

No installation steps are provided in the original README, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original README, so this section is skipped.

📚 Documentation

Metrics: The metrics for this particular checkpoint are available. The NLLB - 200 model was evaluated using BLEU, spBLEU, and chrF++ metrics widely adopted by the machine translation community. Additionally, human evaluation with the XSTS protocol was performed, and the toxicity of the generated translations was measured.
Intended Use
- Primary Uses: Intended for research in machine translation, especially for low - resource languages.
- Primary Users: Researchers and the machine translation research community.
- Out - of - Scope Use Cases: It is a research model not for production deployment. It is trained on general - domain text data and not suitable for domain - specific texts (e.g., medical or legal). It is not for document translation. Translating sequences longer than 512 tokens may degrade quality, and its translations cannot be used as certified translations.
Evaluation Data
- Datasets: The Flores - 200 dataset is used, described in Section 4.
- Motivation: It provides full evaluation coverage of the languages in NLLB - 200.
- Preprocessing: Sentence - split raw text data was preprocessed using SentencePiece, and the SentencePiece model is released with NLLB - 200.
Training Data
- Parallel multilingual data from various sources was used. Details on data selection and construction are in Section 5 of the paper. Monolingual data constructed from Common Crawl was also used, with more details in Section 5.2.
Ethical Considerations
- A reflexive approach was taken in technological development to prioritize human users and minimize risks. Many chosen languages are low - resource, especially African languages. Quality translation can improve education and information access, but it may also make less digitally - literate groups vulnerable to misinformation or scams. The training data was mined from public web sources, and although data cleaning was done, personally identifiable information may not be fully eliminated. Mistranslations may still occur and could have adverse impacts on decision - making.
Caveats and Recommendations
- The model has been tested on the Wikimedia domain with limited investigation on other NLLB - MD supported domains. Supported languages may have variations that the model does not capture, and users should make appropriate assessments.
Carbon Footprint Details
- The carbon dioxide (CO2e) estimate is reported in Section 8.8.

🔧 Technical Details

The exact training algorithm, data, and strategies to handle data imbalances for high and low - resource languages used to train NLLB - 200 are described in the paper: NLLB Team et al, No Language Left Behind: Scaling Human - Centered Machine Translation, Arxiv, 2022.

📄 License

The license for this model is CC - BY - NC - 4.0.

Information Table

Property	Details
Model Type	NLLB - 200's distilled 1.3B variant
Training Data	Parallel multilingual data from various sources and monolingual data from Common Crawl
Datasets	Flores - 200
Metrics	BLEU, spBLEU, chrf++
License	CC - BY - NC - 4.0

Important Notes

⚠️ Important Note

NLLB - 200 is a research model and is not released for production deployment. It is trained on general domain text data and is not intended to be used with domain specific texts, such as medical domain or legal domain. The model is not intended to be used for document translation. The model was trained with input lengths not exceeding 512 tokens, therefore translating longer sequences might result in quality degradation. NLLB - 200 translations can not be used as certified translations.

💡 Usage Tip

Our model has been tested on the Wikimedia domain with limited investigation on other domains supported in NLLB - MD. In addition, the supported languages may have variations that our model is not capturing. Users should make appropriate assessments.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご