IndicTrans2オープンソース機械翻訳モデル - 22種のインド語と英語の高品質な相互翻訳をサポート

ホーム

Indictrans2 En Indic Dist 200M

ai4bharatによって開発

IndicTrans2は22のインド言語と英語の相互翻訳をサポートする高品質な機械翻訳モデルで、このバージョンは200Mパラメータの蒸留版です

機械翻訳

Transformers

複数言語対応オープンソースライセンス:MIT #インド多言語翻訳 #低リソース最適化 #デーヴァナーガリ文字サポート

ダウンロード数 4,461

リリース時間 : 9/12/2023

モデル概要

このモデルは英語と22のインド言語間の高品質な機械翻訳に特化しており、蒸留技術を用いてモデルサイズと性能のバランスを最適化しています

モデル特徴

多言語サポート

22のインド言語と英語の相互翻訳をサポート

効率的な蒸留モデル

200Mパラメータの蒸留バージョンで、性能を維持しながらモデルサイズを縮小

長文コンテキストサポート

RoPEバージョンは最大2048トークンのシーケンスを処理可能（特定バージョン使用時）

多様な文字システムサポート

複数のインド言語の文字システムをサポート（デーヴァナーガリ文字、アラビア文字など）

モデル能力

英語からインド言語への翻訳

インド言語から英語への翻訳

インド言語間の相互翻訳

長文翻訳（RoPEバージョン）

使用事例

多言語コンテンツ作成

多言語ウェブサイトコンテンツ翻訳

英語のウェブサイトコンテンツを複数のインド言語に翻訳

インド地域のユーザーアクセシビリティ向上

政府サービス

公式文書翻訳

政府公告を複数のインド言語版に翻訳

多言語地域における行政情報伝達の促進

教育

教材のローカライゼーション

英語教材を学生の母語版に翻訳

非英語母語話者の学習効果向上

🚀 IndicTrans2

このモデルはIndicTrans2 En-Indic Distilled 200Mバリアントのモデルカードです。このモデルは、多言語翻訳に特化しており、様々なインド語を対象としています。

🚀 クイックスタート

モデルのトレーニング、データ、メトリクスに関する詳細については、TMLR投稿のセクション7.6: Distilled Modelsを参照してください。

📦 インストール

このモデルを使用するには、以下のライブラリが必要です。

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor

💻 使用例

基本的な使用法

# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "eng_Latn", "hin_Deva"
model_name = "ai4bharat/indictrans2-en-indic-dist-200M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "When I was young, I used to go to the park every day.",
    "We watched a new movie last week, which was very inspiring.",
    "If you had met me at that time, we would have gone out to eat.",
    "My friend has invited me to his birthday party, and I will give him a gift.",
]

batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=256,
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

📚 ドキュメント

HF互換のIndicTrans2モデルを推論に使用する方法の詳細については、GitHubリポジトリを参照してください。

🔧 技術詳細

サポート言語

Property	Details
言語	as, bn, brx, doi, en, gom, gu, hi, kn, ks, kas, mai, ml, mr, mni, mnb, ne, or, pa, sa, sat, sd, snd, ta, te, ur
言語詳細	asm_Beng, ben_Beng, brx_Deva, doi_Deva, eng_Latn, gom_Deva, guj_Gujr, hin_Deva, kan_Knda, kas_Arab, kas_Deva, mai_Deva, mal_Mlym, mar_Deva, mni_Beng, mni_Mtei, npi_Deva, ory_Orya, pan_Guru, san_Deva, sat_Olck, snd_Arab, snd_Deva, tam_Taml, tel_Telu, urd_Arab

データセット

flores-200
IN22-Gen
IN22-Conv

メトリクス

bleu
chrf
chrf++
comet

📄 ライセンス

このモデルはMITライセンスの下で提供されています。

📢 長文対応IT2モデル

最大2048トークンのシーケンス長を扱うことができる新しいRoPEベースのIndicTrans2モデルがこちらで利用可能です。
これらのモデルは、model_nameパラメータを変更するだけで使用できます。生成に関する詳細情報については、RoPE-IT2モデルのモデルカードを読んでください。
効率的な生成のために、これらのモデルをflash_attention_2で実行することをお勧めします。

引用

もしあなたがこのモデルを使用する場合は、以下のように引用してください。

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}