Indictrans2 - Indic - Indic 1Bオープンソース翻訳モデル - 22種類のインド語の相互翻訳を無料でサポート

ホーム

Indictrans2 Indic Indic 1B

ai4bharatによって開発

これは22種類のインド言語の相互翻訳をサポートする1Bパラメータモデルで、インド-英語と英語-インドのモデルを結合して調整したものです。

機械翻訳

Transformers

オープンソースライセンス:MIT #インド多言語相互翻訳 #1Bパラメータ大規模モデル #22種類のインド言語

ダウンロード数 1,542

リリース時間 : 11/28/2023

モデル概要

このモデルはインドの22の公用語間の高品質な機械翻訳に特化しており、複数の文字システム間の変換をサポートします。

モデル特徴

多言語サポート

22のインド公用語間の相互翻訳をサポートし、複数の文字システムをカバーします

大規模モデル

1Bパラメータの大規模モデルを採用し、より高品質な翻訳効果を提供します

文字システム変換

デーヴァナーガリ文字、ベンガル文字、タミル文字など、異なる文字システム間の変換を処理できます

モデル能力

インド言語相互翻訳

多文字システム処理

バッチ翻訳

使用事例

言語間コミュニケーション

政府文書の翻訳

政府文書を異なるインド言語間で変換します

政府情報の異なる言語グループ間でのアクセス性を向上させます

教育資料のローカライズ

教育資料を各地域の現地語に翻訳します

教育資源の公平なアクセスを促進します

ビジネスアプリケーション

多言語カスタマーサポート

インドの異なる言語ユーザー向けにサポートコンテンツを提供します

顧客満足度と市場カバレッジを向上させます

🚀 IndicTrans2

このモデルは、Indic-En 1B と En-Indic 1B のバリアントを結合して適応させた IndicTrans2 Indic-Indic 1B バリアントのモデルカードです。

モデルのトレーニング、データ、メトリクスの詳細については、ブログを参照してください。

🚀 クイックスタート

このモデルは、Indic-En 1B と En-Indic 1B のバリアントを結合して適応させた IndicTrans2 Indic-Indic 1B バリアントです。モデルのトレーニング、データ、メトリクスの詳細については、ブログを参照してください。

✨ 主な機能

対応言語: as, bn, brx, doi, gom, gu, hi, kn, ks, mai, ml, mr, mni, ne, or, pa, sa, sat, snd, ta, te, ur
言語詳細: asm_Beng, ben_Beng, brx_Deva, doi_Deva, gom_Deva, guj_Gujr, hin_Deva, kan_Knda, kas_Arab, mai_Deva, mal_Mlym, mar_Deva, mni_Mtei, npi_Deva, ory_Orya, pan_Guru, san_Deva, sat_Olck, snd_Deva, tam_Taml, tel_Telu, urd_Arab
タグ: indictrans2、translation、ai4bharat、multilingual
ライセンス: MIT
データセット: flores-200、IN22-Gen、IN22-Conv
評価指標: bleu、chrf、chrf++、comet

📦 インストール

インストールに関する詳細な手順は、GitHubリポジトリを参照してください。

💻 使用例

基本的な使用法

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "hin_Deva", "tam_Taml"
model_name = "ai4bharat/indictrans2-indic-indic-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।",
    "हमने पिछले सप्ताह एक नई फिल्म देखी जो कि बहुत प्रेरणादायक थी।",
    "अगर तुम मुझे उस समय पास मिलते, तो हम बाहर खाना खाने चलते।",
    "मेरे मित्र ने मुझे उसके जन्मदिन की पार्टी में बुलाया है, और मैं उसे एक तोहफा दूंगा।",
]

batch = ip.preprocess_batch(
    input_sentences,
    src_lang=src_lang,
    tgt_lang=tgt_lang,
)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=256,
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

📄 ライセンス

このモデルは MIT ライセンスの下で提供されています。

📚 ドキュメント

引用

このモデルを使用する場合は、以下のように引用してください。

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}