🚀 IndicTrans2
IndicTrans2是一款用於英語到印度語系翻譯的模型,其2億參數的蒸餾版本能為多種印度語言提供高效準確的翻譯服務。本模型在多語言處理方面表現出色,可助力不同印度語言間的交流。
🚀 快速開始
請參考 TMLR提交文檔的第7.6節:蒸餾模型,以獲取關於模型訓練、數據和評估指標的更多詳細信息。
如需瞭解如何使用與Hugging Face兼容的IndicTrans2模型進行推理,請參考 GitHub倉庫 中的詳細說明。
💻 使用示例
基礎用法
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
src_lang, tgt_lang = "eng_Latn", "hin_Deva"
model_name = "ai4bharat/indictrans2-en-indic-dist-200M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.float16,
attn_implementation="flash_attention_2"
).to(DEVICE)
ip = IndicProcessor(inference=True)
input_sentences = [
"When I was young, I used to go to the park every day.",
"We watched a new movie last week, which was very inspiring.",
"If you had met me at that time, we would have gone out to eat.",
"My friend has invited me to his birthday party, and I will give him a gift.",
]
batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)
inputs = tokenizer(
batch,
truncation=True,
padding="longest",
return_tensors="pt",
return_attention_mask=True,
).to(DEVICE)
with torch.no_grad():
generated_tokens = model.generate(
**inputs,
use_cache=True,
min_length=0,
max_length=256,
num_beams=5,
num_return_sequences=1,
)
generated_tokens = tokenizer.batch_decode(
generated_tokens,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)
for input_sentence, translation in zip(input_sentences, translations):
print(f"{src_lang}: {input_sentence}")
print(f"{tgt_lang}: {translation}")
高級用法
- 長上下文IT2模型:
- 基於RoPE的新型IndicTrans2模型能夠處理 長達2048個標記 的序列,可在 此處 獲取。
- 只需更改
model_name
參數即可使用這些模型。有關生成的更多信息,請閱讀RoPE - IT2模型的模型卡片。
- 建議使用
flash_attention_2
運行這些模型,以實現高效生成。
📚 詳細文檔
支持語言
屬性 |
詳情 |
支持的語言代碼 |
as、bn、brx、doi、en、gom、gu、hi、kn、ks、kas、mai、ml、mr、mni、mnb、ne、or、pa、sa、sat、sd、snd、ta、te、ur |
語言詳情 |
asm_Beng、ben_Beng、brx_Deva、doi_Deva、eng_Latn、gom_Deva、guj_Gujr、hin_Deva、kan_Knda、kas_Arab、kas_Deva、mai_Deva、mal_Mlym、mar_Deva、mni_Beng、mni_Mtei、npi_Deva、ory_Orya、pan_Guru、san_Deva、sat_Olck、snd_Arab、snd_Deva、tam_Taml、tel_Telu、urd_Arab |
相關標籤
- indictrans2
- translation
- ai4bharat
- multilingual
許可證
本項目採用MIT許可證。
數據集
- flores - 200
- IN22 - Gen
- IN22 - Conv
評估指標
推理設置
推理功能已關閉。
📄 許可證
本項目使用MIT許可證。
📖 引用
如果您考慮使用我們的工作,請使用以下引用:
@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}