đ MADLAD-400-7B-MT Model Card
This model card provides detailed information about the MADLAD-400-7B-MT, a multilingual machine translation model. It offers insights into its capabilities, usage, training, and evaluation, enabling users to understand and utilize the model effectively.
đ Quick Start
Using the Pytorch model with transformers
First, install the required Python packages:
pip install transformers accelerate sentencepiece
from transformers import T5ForConditionalGeneration, T5Tokenizer
model_name = 'jbochi/madlad400-7b-mt'
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map="auto")
tokenizer = T5Tokenizer.from_pretrained(model_name)
text = "<2pt> I love pizza!"
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
outputs = model.generate(input_ids=input_ids)
tokenizer.decode(outputs[0], skip_special_tokens=True)
Running the model with Candle
Usage with candle:
$ cargo run --example t5 --release -- \
--model-id "jbochi/madlad400-7b-mt" \
--prompt "<2de> How are you, my friend?" \
--decode --temperature 0
⨠Features
- Multilingual Support: Supports over 400 languages, making it suitable for a wide range of translation tasks.
- Based on T5 Architecture: Leverages the power of the T5 architecture for effective language generation.
- Competitive Performance: Competes with significantly larger models in terms of translation quality.
đĻ Installation
Using transformers
pip install transformers accelerate sentencepiece
đģ Usage Examples
Basic Usage
from transformers import T5ForConditionalGeneration, T5Tokenizer
model_name = 'jbochi/madlad400-7b-mt'
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map="auto")
tokenizer = T5Tokenizer.from_pretrained(model_name)
text = "<2pt> I love pizza!"
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
outputs = model.generate(input_ids=input_ids)
tokenizer.decode(outputs[0], skip_special_tokens=True)
Advanced Usage
$ cargo run --example t5 --release -- \
--model-id "jbochi/madlad400-7b-mt" \
--prompt "<2de> How are you, my friend?" \
--decode --temperature 0
đ Documentation
Model Details
Uses
Direct Use and Downstream Use
Primary intended uses: Machine Translation and multilingual NLP tasks on over 400 languages.
Primary intended users: Research community.
Out-of-Scope Use
These models are trained on general domain data and are therefore not meant to
work on domain-specific models out-of-the box. Moreover, these research models have not been assessed
for production usecases.
Bias, Risks, and Limitations
We note that we evaluate on only 204 of the languages supported by these models and on machine translation
and few-shot machine translation tasks. Users must consider use of this model carefully for their own
usecase.
Ethical considerations and risks
We trained these models with MADLAD-400 and publicly available data to create baseline models that
support NLP for over 400 languages, with a focus on languages underrepresented in large-scale corpora.
Given that these models were trained with web-crawled datasets that may contain sensitive, offensive or
otherwise low-quality content despite extensive preprocessing, it is still possible that these issues to the
underlying training data may cause differences in model performance and toxic (or otherwise problematic)
output for certain domains. Moreover, large models are dual use technologies that have specific risks
associated with their use and development. We point the reader to surveys such as those written by
Weidinger et al. or Bommasani et al. for a more detailed discussion of these risks, and to Liebling
et al. for a thorough discussion of the risks of machine translation systems.
Training Details
Training Data
For both the machine translation and language model, MADLAD-400 is used. For the machine translation
model, a combination of parallel datasources covering 157 languages is also used. Further details are
described in the paper.
Training Procedure
See the research paper for further details.
Evaluation
Testing Data, Factors & Metrics
For evaluation, we used WMT, NTREX, Flores-200 and Gatones datasets as described in Section 4.3 in the paper.
The translation quality of this model varies based on language, as seen in the paper, and likely varies on
domain, though we have not assessed this.
Results



See the research paper for further details.
đ§ Technical Details
The model is trained on 250 billion tokens covering over 450 languages using publicly available data. It shares all parameters across language pairs and uses a Sentence Piece Model with 256k tokens on both the encoder and decoder side. Each input sentence has a <2xx> token prepended to indicate the target language.
đ License
This model is licensed under the Apache 2.0 license.
@misc{kudugunta2023madlad400,
title={MADLAD-400: A Multilingual And Document-Level Large Audited Dataset},
author={Sneha Kudugunta and Isaac Caswell and Biao Zhang and Xavier Garcia and Christopher A. Choquette-Choo and Katherine Lee and Derrick Xin and Aditya Kusupati and Romi Stella and Ankur Bapna and Orhan Firat},
year={2023},
eprint={2309.04662},
archivePrefix={arXiv},
primaryClass={cs.CL}
}