ЁЯЪА IndicTrans2
This is the model card of IndicTrans2, an Indic - Indic 1B variant adapted by stitching Indic - En 1B and En - Indic 1B variants. It offers translation capabilities across multiple Indic languages, contributing to high - quality and accessible machine translation for 22 scheduled Indian languages.
тЬи Features
- Multilingual Support: Supports a wide range of Indic languages including
as
, bn
, brx
, etc.
- High - Quality Translation: Trained on datasets like
flores - 200
, IN22 - Gen
, and IN22 - Conv
, and evaluated using metrics such as bleu
, chrf
, chrf++
, and comet
.
- AI4Bharat Initiative: Part of the
ai4bharat
project, aiming to promote AI for the Indian sub - continent.
Language Details
Property |
Details |
Languages |
as , bn , brx , doi , gom , gu , hi , kn , ks , mai , ml , mr , mni , ne , or , pa , sa , sat , snd , ta , te , ur |
Language Codes |
asm_Beng , ben_Beng , brx_Deva , doi_Deva , gom_Deva , guj_Gujr , hin_Deva , kan_Knda , kas_Arab , mai_Deva , mal_Mlym , mar_Deva , mni_Mtei , npi_Deva , ory_Orya , pan_Guru , san_Deva , sat_Olck , snd_Deva , tam_Taml , tel_Telu , urd_Arab |
Tags
indictrans2
translation
ai4bharat
multilingual
Datasets
flores - 200
IN22 - Gen
IN22 - Conv
Metrics
ЁЯУж Installation
No installation steps are provided in the original document, so this section is skipped.
ЁЯТ╗ Usage Examples
Basic Usage
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
src_lang, tgt_lang = "hin_Deva", "tam_Taml"
model_name = "ai4bharat/indictrans2-indic-indic-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.float16,
attn_implementation="flash_attention_2"
).to(DEVICE)
ip = IndicProcessor(inference=True)
input_sentences = [
"рдЬрдм рдореИрдВ рдЫреЛрдЯрд╛ рдерд╛, рдореИрдВ рд╣рд░ рд░реЛрдЬрд╝ рдкрд╛рд░реНрдХ рдЬрд╛рддрд╛ рдерд╛ред",
"рд╣рдордиреЗ рдкрд┐рдЫрд▓реЗ рд╕рдкреНрддрд╛рд╣ рдПрдХ рдирдИ рдлрд┐рд▓реНрдо рджреЗрдЦреА рдЬреЛ рдХрд┐ рдмрд╣реБрдд рдкреНрд░реЗрд░рдгрд╛рджрд╛рдпрдХ рдереАред",
"рдЕрдЧрд░ рддреБрдо рдореБрдЭреЗ рдЙрд╕ рд╕рдордп рдкрд╛рд╕ рдорд┐рд▓рддреЗ, рддреЛ рд╣рдо рдмрд╛рд╣рд░ рдЦрд╛рдирд╛ рдЦрд╛рдиреЗ рдЪрд▓рддреЗред",
"рдореЗрд░реЗ рдорд┐рддреНрд░ рдиреЗ рдореБрдЭреЗ рдЙрд╕рдХреЗ рдЬрдиреНрдорджрд┐рди рдХреА рдкрд╛рд░реНрдЯреА рдореЗрдВ рдмреБрд▓рд╛рдпрд╛ рд╣реИ, рдФрд░ рдореИрдВ рдЙрд╕реЗ рдПрдХ рддреЛрд╣рдлрд╛ рджреВрдВрдЧрд╛ред",
]
batch = ip.preprocess_batch(
input_sentences,
src_lang=src_lang,
tgt_lang=tgt_lang,
)
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
inputs = tokenizer(
batch,
truncation=True,
padding="longest",
return_tensors="pt",
return_attention_mask=True,
).to(DEVICE)
with torch.no_grad():
generated_tokens = model.generate(
**inputs,
use_cache=True,
min_length=0,
max_length=256,
num_beams=5,
num_return_sequences=1,
)
generated_tokens = tokenizer.batch_decode(
generated_tokens,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)
for input_sentence, translation in zip(input_sentences, translations):
print(f"{src_lang}: {input_sentence}")
print(f"{tgt_lang}: {translation}")
Advanced Usage
There is no advanced usage example in the original document, so this part is skipped.
ЁЯУЪ Documentation
Please refer to the blog for further details on model training, data and metrics. For a detailed description on how to use HF compatible IndicTrans2 models for inference, please refer to the github repository.
ЁЯУД License
This project is licensed under the MIT license.
ЁЯУЪ Citation
If you consider using our work then please cite using:
@article{gala2023indictrans,
title={IndicTrans2: Towards High - Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}