NLLB-200-3.3B-ctranslate2 Open-source Translation Model - Freely Support 200 Languages, a Good Choice for Low-resource Language Translation

Nllb 200 3.3B Ctranslate2

Developed by entai2965

NLLB-200 is a neural machine translation model supporting 200 languages, focusing on translation research for low-resource languages.

Machine Translation Supports Multiple Languages#200-language translation #Low-resource language optimization #Large-scale parameter model

Downloads 25

Release Time : 11/20/2024

Model Overview

NLLB-200 is a large-scale multilingual machine translation model that supports single-sentence translation between 200 languages, with special emphasis on improving translation quality for low-resource languages.

Model Features

Multilingual support

Supports translation between 200 languages, with special focus on low-resource languages

High-quality translation

Excels in low-resource languages, evaluated using metrics like BLEU, spBLEU, and chrF++

Research-oriented

Designed specifically for machine translation research, particularly for low-resource language translation studies

Model Capabilities

Single-sentence translation

Multilingual mutual translation

Low-resource language translation

Use Cases

Academic research

Low-resource language translation research

Used to study how to improve machine translation quality for low-resource languages

Performs excellently on the Flores-200 dataset

Language services

Multilingual content translation

Can be used to translate content into multiple languages

🚀 NLLB-200

This is a machine translation model that supports translation among 200 languages, mainly for research purposes, especially in the field of low - resource languages.

🚀 Quick Start

This model can be downloaded and run using Python. For detailed steps, please refer to the following sections.

✨ Features

Supports single - sentence translation among 200 languages.
Evaluated using multiple metrics such as BLEU, spBLEU, and chrF++.
Trained on the Flores - 200 dataset.

📦 Installation

Download the model

Install Python: https://www.python.org/downloads/
Open the command prompt (cmd)
Check Python version: python --version
Install the huggingface_hub library: python -m pip install huggingface_hub
Run Python and execute the following code:

import huggingface_hub
huggingface_hub.download_snapshot('entai2965/nllb-300-3.3B-ctranslate2', local_dir='nllb-300-3.3B-ctranslate2')

Install dependencies for running the model

Open the command prompt (cmd)
Install ctranslate2 and transformers libraries: python -m pip install ctranslate2 transformers

💻 Usage Examples

Basic Usage

import ctranslate2
import transformers

src_lang = "eng_Latn"
tgt_lang = "fra_Latn"

translator = ctranslate2.Translator("nllb-200-3.3B-ctranslate2", device='cpu')
tokenizer = transformers.AutoTokenizer.from_pretrained("nllb-200-3.3B-ctranslate2", src_lang=src_lang, clean_up_tokenization_spaces=True)

source = tokenizer.convert_ids_to_tokens(tokenizer.encode("Hello world!"))
target_prefix = [tgt_lang]
results = translator.translate_batch([source], target_prefix=[target_prefix])
target = results[0].hypotheses[0][1:]

print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))

Advanced Usage

Batch translation

import os
import ctranslate2
import transformers

# set defaults
home_path = os.path.expanduser('~')
model_folder = home_path + '/Downloads/models/nllb-200-3.3B-ctranslate2'  # 13 GB of memory

string1 = 'Hello world!'
string2 = 'Awesome.'
raw_list = [string1, string2]

# https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200
source_language_code = "eng_Latn"
target_language_code = "fra_Latn"

device = 'cpu'
# device = 'cuda'

# load models
translator = ctranslate2.Translator(model_folder, device=device)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_folder, src_lang=source_language_code, clean_up_tokenization_spaces=True)

# tokenize input
encoded_list = []
for text in raw_list:
    encoded_list.append(tokenizer.convert_ids_to_tokens(tokenizer.encode(text)))

# translate
# https://opennmt.net/CTranslate2/python/ctranslate2.Translator.html?#ctranslate2.Translator.translate_batch
translated_list = translator.translate_batch(encoded_list, target_prefix=[[target_language_code]] * len(raw_list))
assert (len(raw_list) == len(translated_list))

# decode
for counter, tokens in enumerate(translated_list):
    translated_list[counter] = tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens.hypotheses[0][1:]))

# output
for text in translated_list:
    print(text)

Functional programming version

import os
import ctranslate2
import transformers

# set defaults
home_path = os.path.expanduser('~')
model_folder = home_path + '/Downloads/models/nllb-200-3.3B-ctranslate2'  # 13 GB of memory

string1 = 'Hello world!'
string2 = 'Awesome.'
raw_list = [string1, string2]

# https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200
source_language_code = "eng_Latn"
target_language_code = "fra_Latn"

device = 'cpu'
# device = 'cuda'

# load models
translator = ctranslate2.Translator(model_folder, device=device)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_folder, src_lang=source_language_code, clean_up_tokenization_spaces=True)

# invoke black magic
translated_list = [tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens.hypotheses[0][1:])) for tokens in
                   translator.translate_batch([tokenizer.convert_ids_to_tokens(tokenizer.encode(text)) for text in raw_list],
                                              target_prefix=[[target_language_code]] * len(raw_list))]
assert (len(raw_list) == len(translated_list))

# output
for text in translated_list:
    print(text)

📚 Documentation

Intended Use

Primary intended uses: NLLB - 200 is a machine translation model primarily intended for research in machine translation, especially for low - resource languages. It allows for single - sentence translation among 200 languages. Information on how to use the model can be found in the Fairseq code repository along with the training code and references to evaluation and training data.
Primary intended users: The primary users are researchers and the machine translation research community.
Out - of - scope use cases: NLLB - 200 is a research model and is not released for production deployment. It is trained on general - domain text data and is not intended to be used with domain - specific texts, such as medical or legal domain texts. The model is not intended for document translation. Since the model was trained with input lengths not exceeding 512 tokens, translating longer sequences might result in quality degradation. NLLB - 200 translations cannot be used as certified translations.

Metrics

The NLLB - 200 model was evaluated using BLEU, spBLEU, and chrF++ metrics widely adopted by the machine translation community. Additionally, human evaluation was performed with the XSTS protocol, and the toxicity of the generated translations was measured.

Evaluation Data

Datasets: The Flores - 200 dataset is described in Section 4.
Motivation: Flores - 200 was used because it provides full evaluation coverage of the languages in NLLB - 200.
Preprocessing: Sentence - split raw text data was preprocessed using SentencePiece. The SentencePiece model is released along with NLLB - 200.

Training Data

Parallel multilingual data from various sources were used to train the model. A detailed report on data selection and construction process is provided in Section 5 of the paper. Monolingual data constructed from Common Crawl were also used, with more details in Section 5.2.

Ethical Considerations

In this work, a reflexive approach was taken in technological development to prioritize human users and minimize risks transferred to them. Many languages chosen for this study are low - resource languages, especially African languages. While quality translation can improve education and information access in these communities, it may also make groups with lower digital literacy more vulnerable to misinformation or online scams if bad actors misappropriate the work. Regarding data acquisition, the training data were mined from various publicly available web sources. Although much effort was put into data cleaning, personally identifiable information may not be entirely eliminated. Finally, despite efforts to optimize translation quality, mistranslations may still occur, which could have adverse impacts on those relying on these translations for important decisions (especially related to health and safety).

Caveats and Recommendations

The model has been tested on the Wikimedia domain with limited investigation on other domains supported in NLLB - MD. In addition, the supported languages may have variations that the model does not capture. Users should make appropriate assessments.

Carbon Footprint Details

The carbon dioxide (CO2e) estimate is reported in Section 8.8.

🔧 Technical Details

Information about training algorithms, parameters, fairness constraints or other applied approaches, and features. The exact training algorithm, data, and the strategies to handle data imbalances for high - and low - resource languages used to train NLLB - 200 are described in the paper.
Paper or other resource for more information: NLLB Team et al, No Language Left Behind: Scaling Human - Centered Machine Translation, Arxiv, 2022

📄 License

The license of this model is CC - BY - NC.

Available languages

You can find the list of available languages here: https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200

ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab,
aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab,
asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl,
bam_Latn, ban_Latn,bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn,
bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn,
cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn,
dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn,
ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn,
fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr,
hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn,
hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn,

⚠️ Important Note

NLLB - 200 is a research model and is not released for production deployment. It is trained on general - domain text data and is not suitable for domain - specific texts. Translating sequences longer than 512 tokens may lead to quality degradation, and the translations cannot be used as certified translations.

💡 Usage Tip

When using the model, make sure to select appropriate source and target languages according to the available language list. Also, pay attention to the memory requirements when running the model.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご