madlad400 - 3b - mt Open-source Multilingual Processing Model, Free Deployment, Supports over 100 Languages for NLP Tasks

Madlad400 3b Mt

Developed by google

A multilingual processing model supporting over 100 languages, suitable for various natural language processing tasks.

Large Language Model Supports Multiple LanguagesOpen Source License:Apache-2.0 #Extensive multilingual support #Cross-language understanding #Global applications

Downloads 7,035

Release Time : 11/27/2023

Model Overview

This model is a widely-supported multilingual natural language processing model applicable to tasks such as text classification, translation, and question answering.

Model Features

Extensive language support

Supports over 100 languages, including many niche and low-resource languages

Open-source license

Uses Apache 2.0 license, allowing commercial use and modification

Versatility

Applicable to various natural language processing tasks

Model Capabilities

Text classification

Machine translation

Question answering systems

Text generation

Language understanding

Use Cases

Cross-language applications

Multilingual customer service system

Build an automated customer service system supporting multiple languages

Can serve users in different languages simultaneously

Content localization

Automatically translate and adapt content for multiple languages and cultures

Enhances accessibility for global users

Language research

Low-resource language processing

Conduct NLP research on niche and low-resource languages

Promotes preservation of linguistic diversity

🚀 MADLAD-400-3B-MT Model Card

This is a multilingual machine translation model based on the T5 architecture. It was trained on 1 trillion tokens covering over 450 languages using publicly available data, and it can compete with significantly larger models.

🚀 Quick Start

Using the Pytorch model with `transformers`

Running the model on a CPU or GPU

First, install the required Python packages:

pip install transformers accelerate sentencepiece

from transformers import T5ForConditionalGeneration, T5Tokenizer

model_name = 'jbochi/madlad400-3b-mt'
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map="auto")
tokenizer = T5Tokenizer.from_pretrained(model_name)

text = "<2pt> I love pizza!"
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
outputs = model.generate(input_ids=input_ids)

tokenizer.decode(outputs[0], skip_special_tokens=True)
# Eu adoro pizza!

Running the model with Candle

Usage with candle:

$ cargo run --example t5 --release  -- \
  --model-id "jbochi/madlad400-3b-mt" \
  --prompt "<2de> How are you, my friend?" \
  --decode --temperature 0

We also provide a quantized model (1.65 GB vs the original 11.8 GB file):

cargo run --example quantized-t5 --release  -- \
  --model-id "jbochi/madlad400-3b-mt" --weight-file "model-q4k.gguf" \
  --prompt "<2de> How are you, my friend?" \
  --temperature 0
...
 Wie geht es dir, mein Freund?

✨ Features

Multilingual Support: Supports over 400 languages, enabling a wide range of multilingual NLP tasks.
Based on T5 Architecture: Leverages the power of the T5 architecture for effective language processing.
Competitive Performance: Competes with significantly larger models despite its size.

📦 Installation

Using `pip`

pip install transformers accelerate sentencepiece

💻 Usage Examples

Basic Usage

from transformers import T5ForConditionalGeneration, T5Tokenizer

model_name = 'jbochi/madlad400-3b-mt'
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map="auto")
tokenizer = T5Tokenizer.from_pretrained(model_name)

text = "<2pt> I love pizza!"
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
outputs = model.generate(input_ids=input_ids)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Eu adoro pizza!

Model Details

Model Description

Property	Details
Model Type	Language model
Language(s) (NLP)	Multilingual (400+ languages)
License	Apache 2.0
Related Models	All MADLAD-400 Checkpoints
Original Checkpoints	All Original MADLAD-400 Checkpoints
Resources for more information	Research paper GitHub Repo Hugging Face MADLAD-400 Docs (Similar to T5) - Pending PR

Uses

Direct Use and Downstream Use

Primary intended uses: Machine Translation and multilingual NLP tasks on over 400 languages. Primary intended users: Research community.

Out-of-Scope Use

These models are trained on general domain data and are therefore not meant to work on domain-specific models out-of-the box. Moreover, these research models have not been assessed for production usecases.

Bias, Risks, and Limitations

We note that we evaluate on only 204 of the languages supported by these models and on machine translation and few-shot machine translation tasks. Users must consider use of this model carefully for their own usecase.

Ethical considerations and risks

We trained these models with MADLAD-400 and publicly available data to create baseline models that support NLP for over 400 languages, with a focus on languages underrepresented in large-scale corpora. Given that these models were trained with web-crawled datasets that may contain sensitive, offensive or otherwise low-quality content despite extensive preprocessing, it is still possible that these issues to the underlying training data may cause differences in model performance and toxic (or otherwise problematic) output for certain domains. Moreover, large models are dual use technologies that have specific risks associated with their use and development. We point the reader to surveys such as those written by Weidinger et al. or Bommasani et al. for a more detailed discussion of these risks, and to Liebling et al. for a thorough discussion of the risks of machine translation systems.

Training Details

We train models of various sizes: a 3B, 32-layer parameter model, a 7.2B 48-layer parameter model and a 10.7B 32-layer parameter model. We share all parameters of the model across language pairs, and use a Sentence Piece Model with 256k tokens shared on both the encoder and decoder side. Each input sentence has a <2xx> token prepended to the source sentence to indicate the target language.

See the research paper for further details.

Training Data

For both the machine translation and language model, MADLAD-400 is used. For the machine translation model, a combination of parallel datasources covering 157 languages is also used. Further details are described in the paper.

Training Procedure

See the research paper for further details.

Evaluation

Testing Data, Factors & Metrics

For evaluation, we used WMT, NTREX, Flores-200 and Gatones datasets as described in Section 4.3 in the paper.

The translation quality of this model varies based on language, as seen in the paper, and likely varies on domain, though we have not assessed this.

Results

image/png

See the research paper for further details.

Citation

BibTeX:

@misc{kudugunta2023madlad400,
      title={MADLAD-400: A Multilingual And Document-Level Large Audited Dataset}, 
      author={Sneha Kudugunta and Isaac Caswell and Biao Zhang and Xavier Garcia and Christopher A. Choquette-Choo and Katherine Lee and Derrick Xin and Aditya Kusupati and Romi Stella and Ankur Bapna and Orhan Firat},
      year={2023},
      eprint={2309.04662},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Madlad400 3b Mt

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 MADLAD-400-3B-MT Model Card

🚀 Quick Start

Using the Pytorch model with transformers

Running the model on a CPU or GPU

Running the model with Candle

✨ Features

📦 Installation

Using pip

💻 Usage Examples

Basic Usage

📚 Documentation

Table of Contents

Model Details

Model Description

Uses

Direct Use and Downstream Use

Out-of-Scope Use

Bias, Risks, and Limitations

Ethical considerations and risks

Training Details

Training Data

Training Procedure

Evaluation

Testing Data, Factors & Metrics

Results

Citation

Using the Pytorch model with `transformers`

Using `pip`