IndicTrans2オープンソース多言語機械翻訳モデル - 英語と22種類のインド言語の相互翻訳を無料で実現

ホーム

Indictrans2 En Indic 1B

ai4bharatによって開発

IndicTrans2は高品質な多言語機械翻訳モデルで、英語と22のインド言語間の相互翻訳をサポートします

機械翻訳

Transformers

複数言語対応オープンソースライセンス:MIT #インド多言語翻訳 #22のインド言語 #高精度機械翻訳

ダウンロード数 106.30k

リリース時間 : 9/9/2023

モデル概要

このモデルは英語と22のインド言語間の高品質な機械翻訳に特化しており、1.1Bパラメータ規模のアーキテクチャを採用し、複数のインド文字システムをサポートしています。

モデル特徴

多言語サポート

22のインド言語と英語間の相互翻訳をサポートし、複数の文字システムをカバーします

高品質翻訳

1.1Bパラメータ規模を採用し、高品質な翻訳結果を提供します

長文コンテキストサポート

RoPEバリアントモデルは最大2048トークンのシーケンスを処理可能

複数文字システム処理

デーヴァナーガリー文字、ベンガル文字、アラビア文字など複数のインド文字システムを処理可能

モデル能力

英語からインド言語への翻訳

インド言語から英語への翻訳

多言語機械翻訳

長文翻訳

使用事例

クロスランゲージコミュニケーション

政府文書翻訳

政府文書を英語とインドの地方言語間で変換

多言語環境における政府情報のアクセシビリティ向上

教育コンテンツのローカライゼーション

教育資料を異なるインド言語に翻訳

異なる言語グループ間での教育資源の普及促進

ビジネスアプリケーション

多言語カスタマーサポート

企業向けに多言語カスタマーサービスコンテンツの翻訳を提供

多言語市場における企業のカバレッジ拡大

🚀 IndicTrans2

このモデルはIndicTrans2 En-Indic 1.1Bバリアントのモデルカードです。

特定のチェックポイントのメトリクスはこちらです。

モデルのトレーニング、意図された使用方法、データ、メトリクス、制限事項、および推奨事項の詳細については、プレプリントのAppendix D: Model Cardを参照してください。

🚀 クイックスタート

サポート言語

言語	詳細
サポート言語	as, bn, brx, doi, en, gom, gu, hi, kn, ks, kas, mai, ml, mr, mni, mnb, ne, or, pa, sa, sat, sd, snd, ta, te, ur
言語詳細	asm_Beng, ben_Beng, brx_Deva, doi_Deva, eng_Latn, gom_Deva, guj_Gujr, hin_Deva, kan_Knda, kas_Arab, kas_Deva, mai_Deva, mal_Mlym, mar_Deva, mni_Beng, mni_Mtei, npi_Deva, ory_Orya, pan_Guru, san_Deva, sat_Olck, snd_Arab, snd_Deva, tam_Taml, tel_Telu, urd_Arab

ライセンス

データセット

flores-200
IN22-Gen
IN22-Conv

評価指標

bleu
chrf
chrf++
comet

推論

false

✨ 主な機能

このモデルはIndicTrans2 En-Indic 1.1Bバリアントで、特定のチェックポイントのメトリクスを提供します。
モデルのトレーニング、意図された使用方法、データ、メトリクス、制限事項、および推奨事項の詳細は、プレプリントのAppendix D: Model Cardで確認できます。
新しいRoPEベースのIndicTrans2モデルは、最大2048トークンのシーケンス長を処理できます。

📦 インストール

このドキュメントには具体的なインストール手順が記載されていないため、このセクションをスキップします。

💻 使用例

基本的な使用法

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "eng_Latn", "hin_Deva"
model_name = "ai4bharat/indictrans2-en-indic-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "When I was young, I used to go to the park every day.",
    "We watched a new movie last week, which was very inspiring.",
    "If you had met me at that time, we would have gone out to eat.",
    "My friend has invited me to his birthday party, and I will give him a gift.",
]

batch = ip.preprocess_batch(
    input_sentences,
    src_lang=src_lang,
    tgt_lang=tgt_lang,
)

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=256,
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

高度な使用法

新しいRoPEベースのIndicTrans2モデルは、こちらで入手できます。これらのモデルは最大2048トークンのシーケンス長を処理できます。
これらのモデルを使用するには、model_nameパラメータを変更するだけです。生成に関する詳細情報は、RoPE - IT2モデルのモデルカードを参照してください。
これらのモデルは、効率的な生成のためにflash_attention_2で実行することをおすすめします。

📚 ドキュメント

使用方法の詳細な説明については、GitHubリポジトリを参照してください。

🔧 技術詳細

このドキュメントには具体的な技術詳細が記載されていないため、このセクションをスキップします。

📄 ライセンス

このモデルはMITライセンスの下で提供されています。

引用

もしこのモデルを使用する場合は、以下のように引用してください。

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}