IndicTrans2-Indic-Indic-1B Open-source Translation Model - Freely Supports Mutual Translation Among 22 Indian Languages

Indictrans2 Indic Indic 1B

Developed by ai4bharat

This is a model with 1B parameters that supports mutual translation among 22 Indian languages. It is obtained by splicing the Indian-English and English-Indian models and then adjusting.

Machine Translation

Transformers

Open Source License:MIT #Multilingual translation among Indian languages #Large model with 1B parameters #22 Indian languages

Downloads 1,542

Release Time : 11/28/2023

Model Overview

This model focuses on high-quality machine translation among 22 official Indian languages and supports conversions between multiple writing systems.

Model Features

Multilingual support

Supports mutual translation among 22 official Indian languages, covering multiple writing systems

Large model scale

Adopts a large-scale model with 1B parameters to provide higher-quality translation results

Writing system conversion

Can handle conversions between different writing systems, such as Devanagari, Bengali, Tamil, etc.

Model Capabilities

Multilingual translation among Indian languages

Handling of multiple writing systems

Batch translation

Use Cases

Cross - language communication

Government document translation

Convert government documents between different Indian languages

Improve the accessibility of government information among different language groups

Localization of educational materials

Translate educational materials into local languages of different regions

Promote equal access to educational resources

Commercial applications

Multilingual customer support

Provide support content for users of different Indian languages

Improve customer satisfaction and market coverage

🚀 IndicTrans2

This is the model card of IndicTrans2, an Indic - Indic 1B variant adapted by stitching Indic - En 1B and En - Indic 1B variants. It offers translation capabilities across multiple Indic languages, contributing to high - quality and accessible machine translation for 22 scheduled Indian languages.

✨ Features

Multilingual Support: Supports a wide range of Indic languages including as, bn, brx, etc.
High - Quality Translation: Trained on datasets like flores - 200, IN22 - Gen, and IN22 - Conv, and evaluated using metrics such as bleu, chrf, chrf++, and comet.
AI4Bharat Initiative: Part of the ai4bharat project, aiming to promote AI for the Indian sub - continent.

Language Details

Property	Details
Languages	`as`, `bn`, `brx`, `doi`, `gom`, `gu`, `hi`, `kn`, `ks`, `mai`, `ml`, `mr`, `mni`, `ne`, `or`, `pa`, `sa`, `sat`, `snd`, `ta`, `te`, `ur`
Language Codes	`asm_Beng`, `ben_Beng`, `brx_Deva`, `doi_Deva`, `gom_Deva`, `guj_Gujr`, `hin_Deva`, `kan_Knda`, `kas_Arab`, `mai_Deva`, `mal_Mlym`, `mar_Deva`, `mni_Mtei`, `npi_Deva`, `ory_Orya`, `pan_Guru`, `san_Deva`, `sat_Olck`, `snd_Deva`, `tam_Taml`, `tel_Telu`, `urd_Arab`

Datasets

flores - 200
IN22 - Gen
IN22 - Conv

Metrics

bleu
chrf
chrf++
comet

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "hin_Deva", "tam_Taml"
model_name = "ai4bharat/indictrans2-indic-indic-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।",
    "हमने पिछले सप्ताह एक नई फिल्म देखी जो कि बहुत प्रेरणादायक थी।",
    "अगर तुम मुझे उस समय पास मिलते, तो हम बाहर खाना खाने चलते।",
    "मेरे मित्र ने मुझे उसके जन्मदिन की पार्टी में बुलाया है, और मैं उसे एक तोहफा दूंगा।",
]

batch = ip.preprocess_batch(
    input_sentences,
    src_lang=src_lang,
    tgt_lang=tgt_lang,
)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=256,
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

Advanced Usage

There is no advanced usage example in the original document, so this part is skipped.

📚 Documentation

Please refer to the blog for further details on model training, data and metrics. For a detailed description on how to use HF compatible IndicTrans2 models for inference, please refer to the github repository.

📄 License

This project is licensed under the MIT license.

📚 Citation

If you consider using our work then please cite using:

@article{gala2023indictrans,
title={IndicTrans2: Towards High - Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご