Indictrans2-indic-en-dist-200M Open-source Translation Model - Supports Translation between 22 Indian Languages and English

Indictrans2 Indic En Dist 200M

Developed by ai4bharat

This is a machine translation model supporting bidirectional translation between 22 Indian languages and English, optimized using distillation techniques with a parameter size of 200M.

Machine Translation

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Indian Multilingual Translation #22 Indian Languages #Distillation Model

Downloads 3,123

Release Time : 9/12/2023

Model Overview

The model focuses on high-quality machine translation between 22 official Indian languages and English, specifically optimized for the linguistic characteristics of Indian languages.

Model Features

Multilingual Support

Supports bidirectional translation between 22 official Indian languages and English

Distillation Optimization

Employs distillation techniques to reduce model size while maintaining translation quality

Script Adaptation

Specifically adapted for different Indian writing systems (e.g., Devanagari, Bengali script)

Long Context Support

RoPE variant version can handle sequences up to 2048 tokens

Model Capabilities

Text Translation

Multilingual Translation

Indian Language Processing

Use Cases

Cross-Language Communication

Government Document Translation

Converting government announcements and policy documents between different Indian languages

Educational Material Localization

Translating educational content into various Indian language versions

Commercial Applications

Multilingual Customer Support System

Building automated customer support systems supporting multiple Indian languages

E-commerce Localization

Providing multilingual product description translations for e-commerce platforms

🚀 IndicTrans2

This is the model card of the IndicTrans2 Indic-En Distilled 200M variant. It offers translation capabilities across multiple Indian languages, contributing to high - quality and accessible machine translation.

🚀 Quick Start

Please refer to section 7.6: Distilled Models in the TMLR submission for further details on model training, data, and metrics.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "hin_Deva", "eng_Latn"
model_name = "ai4bharat/indictrans2-indic-en-dist-200M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।",
    "हमने पिछले सप्ताह एक नई फिल्म देखी जो कि बहुत प्रेरणादायक थी।",
    "अगर तुम मुझे उस समय पास मिलते, तो हम बाहर खाना खाने चलते।",
    "मेरे मित्र ने मुझे उसके जन्मदिन की पार्टी में बुलाया है, और मैं उसे एक तोहफा दूंगा।",
]

batch = ip.preprocess_batch(
    input_sentences,
    src_lang=src_lang,
    tgt_lang=tgt_lang,
)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=256,
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

📚 Documentation

📢 Long Context IT2 Models

New RoPE based IndicTrans2 models which are capable of handling sequence lengths upto 2048 tokens are available here
These models can be used by just changing the model_name parameter. Please read the model card of the RoPE - IT2 models for more information about the generation.
It is recommended to run these models with flash_attention_2 for efficient generation.

📄 License

This project is licensed under the MIT license.

📚 Additional Information

Supported Languages

Property	Details
Languages	as, bn, brx, doi, en, gom, gu, hi, kn, ks, kas, mai, ml, mr, mni, mnb, ne, or, pa, sa, sat, sd, snd, ta, te, ur
Language Details	asm_Beng, ben_Beng, brx_Deva, doi_Deva, eng_Latn, gom_Deva, guj_Gujr, hin_Deva, kan_Knda, kas_Arab, kas_Deva, mai_Deva, mal_Mlym, mar_Deva, mni_Beng, mni_Mtei, npi_Deva, ory_Orya, pan_Guru, san_Deva, sat_Olck, snd_Arab, snd_Deva, tam_Taml, tel_Telu, urd_Arab

Datasets

flores - 200
IN22 - Gen
IN22 - Conv

Metrics

bleu
chrf
chrf++
comet

Inference

Inference is set to false.

📖 Citation

If you consider using our work then please cite using:

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご