NLLB-200-distilled-600M-ctranslate2 Open-source Translation Model - Supports 200 Languages and Improves Translation Quality for Low-resource Languages

Nllb 200 Distilled 600M Ctranslate2

Developed by entai2965

NLLB-200 is a neural machine translation model supporting 200 languages, with special focus on translation quality for low-resource languages.

Machine Translation Supports Multiple Languages#200-language translation #low-resource language optimization #research-oriented model

Downloads 37

Release Time : 11/20/2024

Model Overview

NLLB-200 is a machine translation model developed by Facebook Research, supporting single-sentence translation between 200 languages, with particular attention to translation quality for low-resource languages. The model employs distillation techniques and has a parameter size of 600M.

Model Features

Multilingual Support

Supports translation between 200 languages, with special focus on low-resource languages

Distillation Technology

Uses model distillation techniques to reduce model size while maintaining performance

Fairness Considerations

Particularly focuses on translation quality for low-resource languages in regions like Africa

Model Capabilities

Single-sentence machine translation

Multilingual translation

Low-resource language translation

Use Cases

Research

Machine Translation Research

Used for research in machine translation, particularly for low-resource language translation techniques

Education

Multilingual Educational Material Translation

Helps translate educational materials into multiple languages, especially low-resource languages

🚀 NLLB-200

This is a machine translation model that aims to support translation among 200 languages, especially beneficial for low - resource languages research. It allows single - sentence translation across a wide range of languages, promoting the development of machine translation technology.

🚀 Quick Start

Download the Model

Install Python from here.
Open the cmd and run the following commands:
- Check Python version: python --version
- Install the huggingface_hub library: python -m pip install huggingface_hub
- Open the Python interpreter: python

import huggingface_hub
huggingface_hub.download_snapshot('entai2965/nllb-200-distilled-600M-ctranslate2', local_dir='nllb-200-distilled-600M-ctranslate2')

Run the Model

Refer to the guide here.
Open the cmd and run: python -m pip install ctranslate2 transformers
Open the Python interpreter: python

import ctranslate2
import transformers

src_lang = "eng_Latn"
tgt_lang = "fra_Latn"

translator = ctranslate2.Translator("nllb-200-distilled-600M-ctranslate2", device="cpu")
tokenizer = transformers.AutoTokenizer.from_pretrained("nllb-200-distilled-600M-ctranslate2", src_lang=src_lang, clean_up_tokenization_spaces=True)

source = tokenizer.convert_ids_to_tokens(tokenizer.encode("Hello world!"))
target_prefix = [tgt_lang]
results = translator.translate_batch([source], target_prefix=[target_prefix])
target = results[0].hypotheses[0][1:]

print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))

Run the Model in Batch Syntax

import os
import ctranslate2
import transformers

#set defaults
home_path = os.path.expanduser('~')
model_folder = home_path + '/Downloads/models/nllb-200-distilled-600M-ctranslate2'  # 3 GB of memory
#model_folder = home_path + '/Downloads/models/nllb-200-distilled-1.3B-ctranslate2'  # 5.5 GB of memory
#model_folder = home_path + '/Downloads/models/nllb-200-3.3B-ctranslate2-float16'  # 13 GB of memory in almost all cases, 7.6 GB on CUDA + GeForce RTX 2000 series and newer
#model_folder = home_path + '/Downloads/models/nllb-200-3.3B-ctranslate2'  # 13 GB of memory

string1 = 'Hello world!'
string2 = 'Awesome.'
raw_list = [string1, string2]

#https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200
source_language_code = "eng_Latn"
target_language_code = "fra_Latn"

device = 'cpu'
#device = 'cuda'

#load models
translator = ctranslate2.Translator(model_folder, device=device)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_folder, src_lang=source_language_code, clean_up_tokenization_spaces=True)

#tokenize input
encoded_list = []
for text in raw_list:
    encoded_list.append(tokenizer.convert_ids_to_tokens(tokenizer.encode(text)))

#translate
#https://opennmt.net/CTranslate2/python/ctranslate2.Translator.html?#ctranslate2.Translator.translate_batch
translated_list = translator.translate_batch(encoded_list, target_prefix=[[target_language_code]] * len(raw_list))
assert(len(raw_list) == len(translated_list))

#decode
for counter, tokens in enumerate(translated_list):
    translated_list[counter] = tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens.hypotheses[0][1:]))

#output
for text in translated_list:
    print(text)

Functional Programming Version

import os
import ctranslate2
import transformers

#set defaults
home_path = os.path.expanduser('~')
model_folder = home_path + '/Downloads/models/nllb-200-distilled-600M-ctranslate2'  # 3 GB of memory
#model_folder = home_path + '/Downloads/models/nllb-200-distilled-1.3B-ctranslate2'  # 5.5 GB of memory
#model_folder = home_path + '/Downloads/models/nllb-200-3.3B-ctranslate2-float16'  # 13 GB of memory in almost all cases, 7.6 GB on CUDA + GeForce RTX 2000 series and newer
#model_folder = home_path + '/Downloads/models/nllb-200-3.3B-ctranslate2'  # 13 GB of memory

string1 = 'Hello world!'
string2 = 'Awesome.'
raw_list = [string1, string2]

#https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200
source_language_code = "eng_Latn"
target_language_code = "fra_Latn"

device = 'cpu'
#device = 'cuda'

#load models
translator = ctranslate2.Translator(model_folder, device=device)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_folder, src_lang=source_language_code, clean_up_tokenization_spaces=True)

#invoke black magic
translated_list = [tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens.hypotheses[0][1:])) for tokens in translator.translate_batch([tokenizer.convert_ids_to_tokens(tokenizer.encode(text)) for text in raw_list], target_prefix=[[target_language_code]] * len(raw_list))]
assert(len(raw_list) == len(translated_list))

#output
for text in translated_list:
    print(text)

✨ Features

Supports single - sentence translation among 200 languages, facilitating research in machine translation, especially for low - resource languages.

📚 Documentation

Intended Use

Primary intended uses: NLLB - 200 is a machine translation model mainly for research in machine translation, especially for low - resource languages. It enables single - sentence translation among 200 languages. Usage information can be found in the Fairseq code repository, along with training code and references to evaluation and training data.
Primary intended users: Researchers and the machine translation research community.
Out - of - scope use cases: It is a research model not for production deployment. Trained on general - domain text data, it's not suitable for domain - specific texts (e.g., medical or legal domains) or document translation. With a maximum input length of 512 tokens, translating longer sequences may degrade quality. Its translations can't be used as certified translations.

Metrics

Model performance measures: The NLLB - 200 model was evaluated using BLEU, spBLEU, and chrF++ metrics widely adopted in the machine translation community. Additionally, human evaluation with the XSTS protocol was performed, and the toxicity of generated translations was measured.

Evaluation Data

Datasets: The Flores - 200 dataset, described in Section 4, was used.
Motivation: It provides full evaluation coverage of the languages in NLLB - 200.
Preprocessing: Sentence - split raw text data was preprocessed using SentencePiece, and the SentencePiece model was released with NLLB - 200.

Training Data

Parallel multilingual data from various sources was used to train the model. Details on data selection and construction are provided in Section 5 of the paper. Monolingual data constructed from Common Crawl was also used, with more details in Section 5.2.

Ethical Considerations

In this work, a reflexive approach was taken in technological development to prioritize human users and minimize risks. Many languages in the study are low - resource languages, especially African languages. While quality translation can improve education and information access, it may also make less digitally literate groups more vulnerable to misinformation or online scams. Training data was mined from public web sources, and although data cleaning was extensive, personally identifiable information may not be completely eliminated. Mistranslations, though rare, could have adverse impacts on decision - making related to health and safety.

Caveats and Recommendations

The model has been tested on the Wikimedia domain with limited investigation on other NLLB - MD supported domains. Supported languages may have variations not captured by the model, so users should make appropriate assessments.

Carbon Footprint Details

The carbon dioxide (CO2e) estimate is reported in Section 8.8.

🔧 Technical Details

Information about training algorithms, parameters, fairness constraints, and other applied approaches is described in the paper: NLLB Team et al, No Language Left Behind: Scaling Human - Centered Machine Translation, Arxiv, 2022.

📄 License

The license for this model is CC - BY - NC.

Available languages

Refer to here for available languages.

ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab,
aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab,
asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl,
bam_Latn, ban_Latn, bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn,
bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn,
cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn,
dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn,
ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn,
fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr,
hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn,
hye_Armn

From: https://huggingface.co/facebook/nllb-200-distilled-600M

Here are the metrics for that particular checkpoint.

For questions or comments about the model, please visit here.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご