ЁЯЪА IndicTrans2
IndicTrans2 is a powerful model for multilingual translation, specifically the Indic - En 1.1B variant. It aims to provide high - quality machine translation services for multiple Indian languages.
Here are the metrics for the particular checkpoint.
For further details on model training, intended use, data, metrics, limitations and recommendations, please refer to Appendix D: Model Card
of the preprint.
тЬи Features
- Multilingual Support: Supports a wide range of Indian languages including as, bn, brx, doi, en, gom, gu, hi, kn, ks, kas, mai, ml, mr, mni, mnb, ne, or, pa, sa, sat, sd, snd, ta, te, ur.
- High - Quality Translation: Trained on datasets like flores - 200, IN22 - Gen, IN22 - Conv, and evaluated using metrics such as bleu, chrf, chrf++, comet.
- Long Context Handling: New RoPE based models can handle sequence lengths up to 2048 tokens.
ЁЯУж Installation
No specific installation steps are provided in the original document.
ЁЯТ╗ Usage Examples
Basic Usage
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
src_lang, tgt_lang = "hin_Deva", "eng_Latn"
model_name = "ai4bharat/indictrans2-indic-en-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.float16,
attn_implementation="flash_attention_2"
).to(DEVICE)
ip = IndicProcessor(inference=True)
input_sentences = [
"рдЬрдм рдореИрдВ рдЫреЛрдЯрд╛ рдерд╛, рдореИрдВ рд╣рд░ рд░реЛрдЬрд╝ рдкрд╛рд░реНрдХ рдЬрд╛рддрд╛ рдерд╛ред",
"рд╣рдордиреЗ рдкрд┐рдЫрд▓реЗ рд╕рдкреНрддрд╛рд╣ рдПрдХ рдирдИ рдлрд┐рд▓реНрдо рджреЗрдЦреА рдЬреЛ рдХрд┐ рдмрд╣реБрдд рдкреНрд░реЗрд░рдгрд╛рджрд╛рдпрдХ рдереАред",
"рдЕрдЧрд░ рддреБрдо рдореБрдЭреЗ рдЙрд╕ рд╕рдордп рдкрд╛рд╕ рдорд┐рд▓рддреЗ, рддреЛ рд╣рдо рдмрд╛рд╣рд░ рдЦрд╛рдирд╛ рдЦрд╛рдиреЗ рдЪрд▓рддреЗред",
"рдореЗрд░реЗ рдорд┐рддреНрд░ рдиреЗ рдореБрдЭреЗ рдЙрд╕рдХреЗ рдЬрдиреНрдорджрд┐рди рдХреА рдкрд╛рд░реНрдЯреА рдореЗрдВ рдмреБрд▓рд╛рдпрд╛ рд╣реИ, рдФрд░ рдореИрдВ рдЙрд╕реЗ рдПрдХ рддреЛрд╣рдлрд╛ рджреВрдВрдЧрд╛ред",
]
batch = ip.preprocess_batch(
input_sentences,
src_lang=src_lang,
tgt_lang=tgt_lang,
)
inputs = tokenizer(
batch,
truncation=True,
padding="longest",
return_tensors="pt",
return_attention_mask=True,
).to(DEVICE)
with torch.no_grad():
generated_tokens = model.generate(
**inputs,
use_cache=True,
min_length=0,
max_length=256,
num_beams=5,
num_return_sequences=1,
)
generated_tokens = tokenizer.batch_decode(
generated_tokens,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)
for input_sentence, translation in zip(input_sentences, translations):
print(f"{src_lang}: {input_sentence}")
print(f"{tgt_lang}: {translation}")
Advanced Usage
- Long Context IT2 Models:
- New RoPE based IndicTrans2 models which are capable of handling sequence lengths upto 2048 tokens are available here.
- These models can be used by just changing the
model_name
parameter. Please read the model card of the RoPE - IT2 models for more information about the generation.
- It is recommended to run these models with
flash_attention_2
for efficient generation.
ЁЯУЪ Documentation
Language Details
Property |
Details |
Supported Languages |
as, bn, brx, doi, en, gom, gu, hi, kn, ks, kas, mai, ml, mr, mni, mnb, ne, or, pa, sa, sat, sd, snd, ta, te, ur |
Language Details |
asm_Beng, ben_Beng, brx_Deva, doi_Deva, eng_Latn, gom_Deva, guj_Gujr, hin_Deva, kan_Knda, kas_Arab, kas_Deva, mai_Deva, mal_Mlym, mar_Deva, mni_Beng, mni_Mtei, npi_Deva, ory_Orya, pan_Guru, san_Deva, sat_Olck, snd_Arab, snd_Deva, tam_Taml, tel_Telu, urd_Arab |
Tags |
indictrans2, translation, ai4bharat, multilingual |
License |
mit |
Datasets |
flores - 200, IN22 - Gen, IN22 - Conv |
Metrics |
bleu, chrf, chrf++, comet |
Inference |
false |
ЁЯУД License
This project is licensed under the MIT license.
ЁЯУЦ Citation
If you consider using our work then please cite using:
@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}