ЁЯЪА IndicTrans2
This is the model card of the IndicTrans2 Indic-En Distilled 200M variant. It offers translation capabilities across multiple Indian languages, contributing to high - quality and accessible machine translation.
ЁЯЪА Quick Start
Please refer to section 7.6: Distilled Models in the TMLR submission for further details on model training, data, and metrics.
ЁЯУж Installation
No specific installation steps are provided in the original document.
ЁЯТ╗ Usage Examples
Basic Usage
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
src_lang, tgt_lang = "hin_Deva", "eng_Latn"
model_name = "ai4bharat/indictrans2-indic-en-dist-200M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.float16,
attn_implementation="flash_attention_2"
).to(DEVICE)
ip = IndicProcessor(inference=True)
input_sentences = [
"рдЬрдм рдореИрдВ рдЫреЛрдЯрд╛ рдерд╛, рдореИрдВ рд╣рд░ рд░реЛрдЬрд╝ рдкрд╛рд░реНрдХ рдЬрд╛рддрд╛ рдерд╛ред",
"рд╣рдордиреЗ рдкрд┐рдЫрд▓реЗ рд╕рдкреНрддрд╛рд╣ рдПрдХ рдирдИ рдлрд┐рд▓реНрдо рджреЗрдЦреА рдЬреЛ рдХрд┐ рдмрд╣реБрдд рдкреНрд░реЗрд░рдгрд╛рджрд╛рдпрдХ рдереАред",
"рдЕрдЧрд░ рддреБрдо рдореБрдЭреЗ рдЙрд╕ рд╕рдордп рдкрд╛рд╕ рдорд┐рд▓рддреЗ, рддреЛ рд╣рдо рдмрд╛рд╣рд░ рдЦрд╛рдирд╛ рдЦрд╛рдиреЗ рдЪрд▓рддреЗред",
"рдореЗрд░реЗ рдорд┐рддреНрд░ рдиреЗ рдореБрдЭреЗ рдЙрд╕рдХреЗ рдЬрдиреНрдорджрд┐рди рдХреА рдкрд╛рд░реНрдЯреА рдореЗрдВ рдмреБрд▓рд╛рдпрд╛ рд╣реИ, рдФрд░ рдореИрдВ рдЙрд╕реЗ рдПрдХ рддреЛрд╣рдлрд╛ рджреВрдВрдЧрд╛ред",
]
batch = ip.preprocess_batch(
input_sentences,
src_lang=src_lang,
tgt_lang=tgt_lang,
)
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
inputs = tokenizer(
batch,
truncation=True,
padding="longest",
return_tensors="pt",
return_attention_mask=True,
).to(DEVICE)
with torch.no_grad():
generated_tokens = model.generate(
**inputs,
use_cache=True,
min_length=0,
max_length=256,
num_beams=5,
num_return_sequences=1,
)
generated_tokens = tokenizer.batch_decode(
generated_tokens,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)
for input_sentence, translation in zip(input_sentences, translations):
print(f"{src_lang}: {input_sentence}")
print(f"{tgt_lang}: {translation}")
ЁЯУЪ Documentation
ЁЯУв Long Context IT2 Models
- New RoPE based IndicTrans2 models which are capable of handling sequence lengths upto 2048 tokens are available here
- These models can be used by just changing the
model_name
parameter. Please read the model card of the RoPE - IT2 models for more information about the generation.
- It is recommended to run these models with
flash_attention_2
for efficient generation.
ЁЯУД License
This project is licensed under the MIT license.
ЁЯУЪ Additional Information
Supported Languages
Property |
Details |
Languages |
as, bn, brx, doi, en, gom, gu, hi, kn, ks, kas, mai, ml, mr, mni, mnb, ne, or, pa, sa, sat, sd, snd, ta, te, ur |
Language Details |
asm_Beng, ben_Beng, brx_Deva, doi_Deva, eng_Latn, gom_Deva, guj_Gujr, hin_Deva, kan_Knda, kas_Arab, kas_Deva, mai_Deva, mal_Mlym, mar_Deva, mni_Beng, mni_Mtei, npi_Deva, ory_Orya, pan_Guru, san_Deva, sat_Olck, snd_Arab, snd_Deva, tam_Taml, tel_Telu, urd_Arab |
Tags
- indictrans2
- translation
- ai4bharat
- multilingual
Datasets
- flores - 200
- IN22 - Gen
- IN22 - Conv
Metrics
Inference
Inference is set to false.
ЁЯУЦ Citation
If you consider using our work then please cite using:
@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}