đ IndicTrans2
This is the model card for the IndicTrans2 En-Indic Distilled 200M variant. It offers multilingual translation capabilities, supporting a wide range of Indian languages.
⨠Features
- Multilingual Support: Supports languages such as as, bn, brx, doi, en, gom, gu, hi, kn, ks, kas, mai, ml, mr, mni, mnb, ne, or, pa, sa, sat, sd, snd, ta, te, ur.
- Translation Metrics: Evaluated using metrics like bleu, chrf, chrf++, and comet.
- Long Context Handling: New RoPE - based models can handle sequence lengths up to 2048 tokens.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
src_lang, tgt_lang = "eng_Latn", "hin_Deva"
model_name = "ai4bharat/indictrans2-en-indic-dist-200M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.float16,
attn_implementation="flash_attention_2"
).to(DEVICE)
ip = IndicProcessor(inference=True)
input_sentences = [
"When I was young, I used to go to the park every day.",
"We watched a new movie last week, which was very inspiring.",
"If you had met me at that time, we would have gone out to eat.",
"My friend has invited me to his birthday party, and I will give him a gift.",
]
batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)
inputs = tokenizer(
batch,
truncation=True,
padding="longest",
return_tensors="pt",
return_attention_mask=True,
).to(DEVICE)
with torch.no_grad():
generated_tokens = model.generate(
**inputs,
use_cache=True,
min_length=0,
max_length=256,
num_beams=5,
num_return_sequences=1,
)
generated_tokens = tokenizer.batch_decode(
generated_tokens,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)
for input_sentence, translation in zip(input_sentences, translations):
print(f"{src_lang}: {input_sentence}")
print(f"{tgt_lang}: {translation}")
Advanced Usage
New RoPE - based IndicTrans2 models which are capable of handling sequence lengths upto 2048 tokens are available here.
These models can be used by just changing the model_name
parameter. It is recommended to run these models with flash_attention_2
for efficient generation.
đ Documentation
Please refer to section 7.6: Distilled Models in the TMLR submission for further details on model training, data, and metrics.
For a detailed description of how to use HF - compatible IndicTrans2 models for inference, please refer to the github repository.
đ License
This project is licensed under the MIT license.
đ Model Information
Property |
Details |
Supported Languages |
as, bn, brx, doi, en, gom, gu, hi, kn, ks, kas, mai, ml, mr, mni, mnb, ne, or, pa, sa, sat, sd, snd, ta, te, ur |
Language Details |
asm_Beng, ben_Beng, brx_Deva, doi_Deva, eng_Latn, gom_Deva, guj_Gujr, hin_Deva, kan_Knda, kas_Arab, kas_Deva, mai_Deva, mal_Mlym, mar_Deva, mni_Beng, mni_Mtei, npi_Deva, ory_Orya, pan_Guru, san_Deva, sat_Olck, snd_Arab, snd_Deva, tam_Taml, tel_Telu, urd_Arab |
Tags |
indictrans2, translation, ai4bharat, multilingual |
Datasets |
flores - 200, IN22 - Gen, IN22 - Conv |
Metrics |
bleu, chrf, chrf++, comet |
đ Citation
If you consider using our work, please cite using:
@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}