Indictrans2-indic-en-1B Open-source Machine Translation Model - Supports Mutual Translation between 25 Indian Languages and English

Indictrans2 Indic En 1B

Developed by ai4bharat

A 1.1B-parameter machine translation model supporting mutual translation between 25 Indian languages and English, developed by the AI4Bharat project

Machine Translation

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Indian Multilingual Translation #22 Language Support #High-Precision Translation

Downloads 473.63k

Release Time : 9/9/2023

Model Overview

A large-scale multilingual neural machine translation model based on the Transformer architecture, focusing on high-quality mutual translation between 22 scheduled Indian languages and English, supporting multiple script conversions

Model Features

Multi-Script Support

Supports conversion between multiple scripts for the same language (e.g., Kashmiri Arabic/Devanagari scripts)

Large-Scale Language Coverage

Covers 22 scheduled Indian languages and English, totaling 25 language variants

Long Text Processing Capability

RoPE variant supports sequence lengths of 2048 tokens (requires specific version)

Entity Retention Mechanism

Preserves entity information such as proper nouns through post-processing

Model Capabilities

Indian Language Mutual Translation

English-Indian Language Bidirectional Translation

Multi-Script Conversion

Long Text Translation

Use Cases

Cross-Language Communication

Government Document Translation

Converting government announcements across multiple languages

Maintaining consistency in official terminology

Content Localization

Educational Material Translation

Translating textbooks into regional language versions

🚀 IndicTrans2

IndicTrans2 is a powerful model for multilingual translation, specifically the Indic - En 1.1B variant. It aims to provide high - quality machine translation services for multiple Indian languages.

Here are the metrics for the particular checkpoint.

For further details on model training, intended use, data, metrics, limitations and recommendations, please refer to Appendix D: Model Card of the preprint.

✨ Features

Multilingual Support: Supports a wide range of Indian languages including as, bn, brx, doi, en, gom, gu, hi, kn, ks, kas, mai, ml, mr, mni, mnb, ne, or, pa, sa, sat, sd, snd, ta, te, ur.
High - Quality Translation: Trained on datasets like flores - 200, IN22 - Gen, IN22 - Conv, and evaluated using metrics such as bleu, chrf, chrf++, comet.
Long Context Handling: New RoPE based models can handle sequence lengths up to 2048 tokens.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "hin_Deva", "eng_Latn"
model_name = "ai4bharat/indictrans2-indic-en-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।",
    "हमने पिछले सप्ताह एक नई फिल्म देखी जो कि बहुत प्रेरणादायक थी।",
    "अगर तुम मुझे उस समय पास मिलते, तो हम बाहर खाना खाने चलते।",
    "मेरे मित्र ने मुझे उसके जन्मदिन की पार्टी में बुलाया है, और मैं उसे एक तोहफा दूंगा।",
]

batch = ip.preprocess_batch(
    input_sentences,
    src_lang=src_lang,
    tgt_lang=tgt_lang,
)

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=256,
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

Advanced Usage

Long Context IT2 Models:
- New RoPE based IndicTrans2 models which are capable of handling sequence lengths upto 2048 tokens are available here.
- These models can be used by just changing the model_name parameter. Please read the model card of the RoPE - IT2 models for more information about the generation.
- It is recommended to run these models with flash_attention_2 for efficient generation.

📚 Documentation

Language Details

Property	Details
Supported Languages	as, bn, brx, doi, en, gom, gu, hi, kn, ks, kas, mai, ml, mr, mni, mnb, ne, or, pa, sa, sat, sd, snd, ta, te, ur
Language Details	asm_Beng, ben_Beng, brx_Deva, doi_Deva, eng_Latn, gom_Deva, guj_Gujr, hin_Deva, kan_Knda, kas_Arab, kas_Deva, mai_Deva, mal_Mlym, mar_Deva, mni_Beng, mni_Mtei, npi_Deva, ory_Orya, pan_Guru, san_Deva, sat_Olck, snd_Arab, snd_Deva, tam_Taml, tel_Telu, urd_Arab
Tags	indictrans2, translation, ai4bharat, multilingual
License	mit
Datasets	flores - 200, IN22 - Gen, IN22 - Conv
Metrics	bleu, chrf, chrf++, comet
Inference	false

📄 License

This project is licensed under the MIT license.

📖 Citation

If you consider using our work then please cite using:

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご