Indictrans2オープンソースインド語系相互翻訳モデル - 無料で22種類のインド公用語の相互翻訳をサポート

ホーム

Indictrans2 Indic Indic Dist 320M

ai4bharatによって開発

インド語族相互翻訳2は、22種類のインド公用語の相互翻訳をサポートする高品質な機械翻訳モデルで、320Mパラメータの蒸留バリアントに基づいています。

機械翻訳

Transformers

オープンソースライセンス:MIT #インド語族相互翻訳 #多言語翻訳 #高精度翻訳

ダウンロード数 4,254

リリース時間 : 11/28/2023

モデル概要

このモデルは、インドの22種類の公用語間の相互翻訳タスクに特化しており、蒸留技術により翻訳品質と効率が最適化されています。

モデル特徴

多言語サポート

22種類のインド公用語間の相互翻訳をサポート

高品質翻訳

蒸留技術により翻訳品質を最適化

効率的な推論

flash_attentionによる推論の高速化をサポート

モデル能力

テキスト翻訳

多言語相互翻訳

言語間変換

使用事例

言語間コミュニケーション

政府文書の翻訳

政府文書を異なるインド言語間で変換

ニュースコンテンツのローカライズ

ニュースコンテンツを各地域の言語に翻訳

教育応用

教材の翻訳

教育資料を異なる言語バージョンに翻訳

🚀 IndicTrans2

このモデルは、Indic-En Distilled 200M と En-Indic Distilled 200M のバリアントをステッチした後に適応させた、IndicTrans2 Indic-Indic Distilled 320M バリアントのモデルカードです。

モデルのトレーニング、データ、メトリクスの詳細については、ブログを参照してください。

✨ 主な機能

対応言語
- as、bn、brx、doi、gom、gu、hi、kn、ks、mai、ml、mr、mni、ne、or、pa、sa、sat、snd、ta、te、ur
- 言語詳細: asm_Beng, ben_Beng, brx_Deva, doi_Deva, gom_Deva, guj_Gujr, hin_Deva, kan_Knda, kas_Arab, mai_Deva, mal_Mlym, mar_Deva, mni_Mtei, npi_Deva, ory_Orya, pan_Guru, san_Deva, sat_Olck, snd_Deva, tam_Taml, tel_Telu, urd_Arab
タグ
- indictrans2、translation、ai4bharat、multilingual
ライセンス: MIT
データセット
- flores-200、IN22-Gen、IN22-Conv
評価指標
- bleu、chrf、chrf++、comet
推論: 無効

プロパティ	詳細
モデルタイプ	IndicTrans2 Indic-Indic Distilled 320M variant
トレーニングデータ	flores-200、IN22-Gen、IN22-Conv

📦 インストール

READMEに具体的なインストール手順が記載されていないため、このセクションをスキップします。

💻 使用例

基本的な使用法

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "hin_Deva", "tam_Taml"
model_name = "ai4bharat/indictrans2-indic-indic-dist-320M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।",
    "हमने पिछले सप्ताह एक नई फिल्म देखी जो कि बहुत प्रेरणादायक थी।",
    "अगर तुम मुझे उस समय पास मिलते, तो हम बाहर खाना खाने चलते।",
    "मेरे मित्र ने मुझे उसके जन्मदिन की पार्टी में बुलाया है, और मैं उसे एक तोहफा दूंगा।",
]

batch = ip.preprocess_batch(
    input_sentences,
    src_lang=src_lang,
    tgt_lang=tgt_lang,
)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=256,
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

📚 ドキュメント

HF互換のIndicTrans2モデルを推論に使用する方法の詳細な説明については、GitHubリポジトリを参照してください。

🔧 技術詳細

READMEに具体的な技術詳細が記載されていないため、このセクションをスキップします。

📄 ライセンス

このモデルはMITライセンスの下で提供されています。

📖 引用

もしあなたがこのモデルを使用する場合は、以下のように引用してください。

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}