madlad400-7b-mt Open-source Multilingual Translation Model - Support Free Mutual Translation Among Over 400 Languages

Madlad400 7b Mt

Developed by google

A multilingual machine translation model based on the T5 architecture, supporting 400+ languages, trained with 250 billion tokens

Machine Translation Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual machine translation #450+ language support #T5 architecture optimization

Downloads 4,450

Release Time : 11/27/2023

Model Overview

MADLAD-400-7B-MT is a multilingual machine translation model based on the T5 architecture, supporting translation tasks for over 450 languages. The model is trained on publicly available data and can compete with larger-scale models in performance.

Model Features

Extensive language support

Supports translation tasks for over 450 languages, covering major global languages and numerous niche languages

Efficient performance

Despite its 7B parameter size, it can compete with larger models, delivering high-quality translation results

Open-source license

Uses the Apache 2.0 license, allowing for both commercial and research use

Model Capabilities

Multilingual text translation

Cross-language text generation

Large-scale language understanding

Use Cases

Machine translation

English to Portuguese translation

Translate English text into Portuguese

High-quality translation results

German to English translation

Translate German text into English

Accurately preserves semantics and context

Multilingual applications

Multilingual content generation

Generate text content in multiple languages

Supports text generation in 400+ languages

🚀 MADLAD-400-7B-MT Model Card

This model card provides detailed information about the MADLAD-400-7B-MT, a multilingual machine translation model. It offers insights into its capabilities, usage, training, and evaluation, enabling users to understand and utilize the model effectively.

🚀 Quick Start

Using the Pytorch model with `transformers`

First, install the required Python packages:

pip install transformers accelerate sentencepiece

from transformers import T5ForConditionalGeneration, T5Tokenizer

model_name = 'jbochi/madlad400-7b-mt'
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map="auto")
tokenizer = T5Tokenizer.from_pretrained(model_name)

text = "<2pt> I love pizza!"
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
outputs = model.generate(input_ids=input_ids)

tokenizer.decode(outputs[0], skip_special_tokens=True)
# Eu adoro pizza!

Running the model with Candle

Usage with candle:

$ cargo run --example t5 --release  -- \
  --model-id "jbochi/madlad400-7b-mt" \
  --prompt "<2de> How are you, my friend?" \
  --decode --temperature 0

✨ Features

Multilingual Support: Supports over 400 languages, making it suitable for a wide range of translation tasks.
Based on T5 Architecture: Leverages the power of the T5 architecture for effective language generation.
Competitive Performance: Competes with significantly larger models in terms of translation quality.

📦 Installation

Using `transformers`

pip install transformers accelerate sentencepiece

💻 Usage Examples

Basic Usage

from transformers import T5ForConditionalGeneration, T5Tokenizer

model_name = 'jbochi/madlad400-7b-mt'
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map="auto")
tokenizer = T5Tokenizer.from_pretrained(model_name)

text = "<2pt> I love pizza!"
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
outputs = model.generate(input_ids=input_ids)

tokenizer.decode(outputs[0], skip_special_tokens=True)
# Eu adoro pizza!

Advanced Usage

$ cargo run --example t5 --release  -- \
  --model-id "jbochi/madlad400-7b-mt" \
  --prompt "<2de> How are you, my friend?" \
  --decode --temperature 0

📚 Documentation

Model Details

Property	Details
Model Type	Language model
Language(s) (NLP)	Multilingual (400+ languages)
License	Apache 2.0
Related Models	All MADLAD-400 Checkpoints
Original Checkpoints	All Original MADLAD-400 Checkpoints
Resources for more information	Research paper, GitHub Repo, Hugging Face MADLAD-400 Docs (Similar to T5) - Pending PR

Uses

Direct Use and Downstream Use

Primary intended uses: Machine Translation and multilingual NLP tasks on over 400 languages. Primary intended users: Research community.

Out-of-Scope Use

These models are trained on general domain data and are therefore not meant to work on domain-specific models out-of-the box. Moreover, these research models have not been assessed for production usecases.

Bias, Risks, and Limitations

We note that we evaluate on only 204 of the languages supported by these models and on machine translation and few-shot machine translation tasks. Users must consider use of this model carefully for their own usecase.

Ethical considerations and risks

We trained these models with MADLAD-400 and publicly available data to create baseline models that support NLP for over 400 languages, with a focus on languages underrepresented in large-scale corpora. Given that these models were trained with web-crawled datasets that may contain sensitive, offensive or otherwise low-quality content despite extensive preprocessing, it is still possible that these issues to the underlying training data may cause differences in model performance and toxic (or otherwise problematic) output for certain domains. Moreover, large models are dual use technologies that have specific risks associated with their use and development. We point the reader to surveys such as those written by Weidinger et al. or Bommasani et al. for a more detailed discussion of these risks, and to Liebling et al. for a thorough discussion of the risks of machine translation systems.

Training Details

Training Data

For both the machine translation and language model, MADLAD-400 is used. For the machine translation model, a combination of parallel datasources covering 157 languages is also used. Further details are described in the paper.

Training Procedure

See the research paper for further details.

Evaluation

Testing Data, Factors & Metrics

For evaluation, we used WMT, NTREX, Flores-200 and Gatones datasets as described in Section 4.3 in the paper.

The translation quality of this model varies based on language, as seen in the paper, and likely varies on domain, though we have not assessed this.

Results

image/png