IndicTrans2 Open-Source Machine Translation Model - Supports High-Quality Mutual Translation between 22 Indian Languages and English

Indictrans2 En Indic Dist 200M

Developed by ai4bharat

IndicTrans2 is a high-quality machine translation model supporting translation between English and 22 Indian languages. This version is a distilled 200M parameter model

Machine Translation

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Indian multilingual translation #Low-resource optimization #Devanagari script support

Downloads 4,461

Release Time : 9/12/2023

Model Overview

This model focuses on high-quality machine translation between English and 22 Indian languages, using distillation technology to optimize the balance between model size and performance

Model Features

Multilingual support

Supports bidirectional translation between English and 22 Indian languages

Efficient distilled model

200M parameter distilled version that reduces model size while maintaining performance

Long context support

RoPE version can handle sequences up to 2048 tokens (requires specific version)

Multiple script systems support

Supports various Indian language scripts (e.g., Devanagari, Arabic, etc.)

Model Capabilities

English to Indian languages translation

Indian languages to English translation

Bidirectional translation between Indian languages

Long text translation (RoPE version)

Use Cases

Multilingual content creation

Multilingual website content translation

Translate English website content into multiple Indian languages

Improves accessibility for users in Indian regions

Government services

Official document translation

Translate government announcements into multiple Indian language versions

Facilitates government information dissemination in multilingual regions

Education

Educational material localization

Translate English teaching materials into students' native languages

Improves learning outcomes for non-native English speaking students

🚀 IndicTrans2

This is the model card for the IndicTrans2 En-Indic Distilled 200M variant. It offers multilingual translation capabilities, supporting a wide range of Indian languages.

✨ Features

Multilingual Support: Supports languages such as as, bn, brx, doi, en, gom, gu, hi, kn, ks, kas, mai, ml, mr, mni, mnb, ne, or, pa, sa, sat, sd, snd, ta, te, ur.
Translation Metrics: Evaluated using metrics like bleu, chrf, chrf++, and comet.
Long Context Handling: New RoPE - based models can handle sequence lengths up to 2048 tokens.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "eng_Latn", "hin_Deva"
model_name = "ai4bharat/indictrans2-en-indic-dist-200M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "When I was young, I used to go to the park every day.",
    "We watched a new movie last week, which was very inspiring.",
    "If you had met me at that time, we would have gone out to eat.",
    "My friend has invited me to his birthday party, and I will give him a gift.",
]

batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=256,
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

Advanced Usage

New RoPE - based IndicTrans2 models which are capable of handling sequence lengths upto 2048 tokens are available here. These models can be used by just changing the model_name parameter. It is recommended to run these models with flash_attention_2 for efficient generation.

📚 Documentation

Please refer to section 7.6: Distilled Models in the TMLR submission for further details on model training, data, and metrics. For a detailed description of how to use HF - compatible IndicTrans2 models for inference, please refer to the github repository.

📄 License

This project is licensed under the MIT license.

📋 Model Information

Property	Details
Supported Languages	as, bn, brx, doi, en, gom, gu, hi, kn, ks, kas, mai, ml, mr, mni, mnb, ne, or, pa, sa, sat, sd, snd, ta, te, ur
Language Details	asm_Beng, ben_Beng, brx_Deva, doi_Deva, eng_Latn, gom_Deva, guj_Gujr, hin_Deva, kan_Knda, kas_Arab, kas_Deva, mai_Deva, mal_Mlym, mar_Deva, mni_Beng, mni_Mtei, npi_Deva, ory_Orya, pan_Guru, san_Deva, sat_Olck, snd_Arab, snd_Deva, tam_Taml, tel_Telu, urd_Arab
Tags	indictrans2, translation, ai4bharat, multilingual
Datasets	flores - 200, IN22 - Gen, IN22 - Conv
Metrics	bleu, chrf, chrf++, comet

📖 Citation

If you consider using our work, please cite using:

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご