IndicTrans2開源機器翻譯模型 - 支持22種印度語言與英語高質量互譯

首頁

Indictrans2 En Indic Dist 200M

由ai4bharat開發

IndicTrans2是一個支持22種印度語言與英語互譯的高質量機器翻譯模型，本版本為200M參數的蒸餾版

機器翻譯

Transformers

支持多種語言開源協議:MIT #印度多語言翻譯 #低資源優化 #天城文支持

下載量 4,461

發布時間 : 9/12/2023

模型概述

該模型專注於英語與22種印度語言之間的高質量機器翻譯，採用蒸餾技術優化了模型大小與性能平衡

模型特點

多語言支持

支持22種印度語言與英語之間的互譯

高效蒸餾模型

200M參數的蒸餾版本，在保持性能的同時減小模型規模

長上下文支持

RoPE版本可處理最長2048個標記的序列（需使用特定版本）

多種文字系統支持

支持多種印度語言的文字系統（如天城文、阿拉伯文等）

模型能力

英語到印度語言翻譯

印度語言到英語翻譯

印度語言之間互譯

長文本翻譯（RoPE版本）

使用案例

多語言內容創作

多語言網站內容翻譯

將英語網站內容翻譯為多種印度語言

提高印度地區用戶的可訪問性

政府服務

官方文件翻譯

將政府公告翻譯為多種印度語言版本

促進多語言地區的政務信息傳達

教育

教學材料本地化

將英語教材翻譯為學生母語版本

提高非英語母語學生的學習效果

🚀 IndicTrans2

IndicTrans2是一款用於英語到印度語系翻譯的模型，其2億參數的蒸餾版本能為多種印度語言提供高效準確的翻譯服務。本模型在多語言處理方面表現出色，可助力不同印度語言間的交流。

🚀 快速開始

請參考 TMLR提交文檔的第7.6節：蒸餾模型，以獲取關於模型訓練、數據和評估指標的更多詳細信息。

如需瞭解如何使用與Hugging Face兼容的IndicTrans2模型進行推理，請參考 GitHub倉庫中的詳細說明。

💻 使用示例

基礎用法

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "eng_Latn", "hin_Deva"
model_name = "ai4bharat/indictrans2-en-indic-dist-200M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "When I was young, I used to go to the park every day.",
    "We watched a new movie last week, which was very inspiring.",
    "If you had met me at that time, we would have gone out to eat.",
    "My friend has invited me to his birthday party, and I will give him a gift.",
]

batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=256,
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

高級用法

長上下文IT2模型：
- 基於RoPE的新型IndicTrans2模型能夠處理 長達2048個標記 的序列，可在此處獲取。
- 只需更改 model_name 參數即可使用這些模型。有關生成的更多信息，請閱讀RoPE - IT2模型的模型卡片。
- 建議使用 flash_attention_2 運行這些模型，以實現高效生成。

📚 詳細文檔

支持語言

屬性	詳情
支持的語言代碼	as、bn、brx、doi、en、gom、gu、hi、kn、ks、kas、mai、ml、mr、mni、mnb、ne、or、pa、sa、sat、sd、snd、ta、te、ur
語言詳情	asm_Beng、ben_Beng、brx_Deva、doi_Deva、eng_Latn、gom_Deva、guj_Gujr、hin_Deva、kan_Knda、kas_Arab、kas_Deva、mai_Deva、mal_Mlym、mar_Deva、mni_Beng、mni_Mtei、npi_Deva、ory_Orya、pan_Guru、san_Deva、sat_Olck、snd_Arab、snd_Deva、tam_Taml、tel_Telu、urd_Arab

許可證

本項目採用MIT許可證。

數據集

flores - 200
IN22 - Gen
IN22 - Conv

評估指標

bleu
chrf
chrf++
comet

推理設置

推理功能已關閉。

📄 許可證

本項目使用MIT許可證。

📖 引用

如果您考慮使用我們的工作，請使用以下引用：

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}