🚀 IndicTrans2
IndicTrans2是一款用于英语到印度语系翻译的模型,其2亿参数的蒸馏版本能为多种印度语言提供高效准确的翻译服务。本模型在多语言处理方面表现出色,可助力不同印度语言间的交流。
🚀 快速开始
请参考 TMLR提交文档的第7.6节:蒸馏模型,以获取关于模型训练、数据和评估指标的更多详细信息。
如需了解如何使用与Hugging Face兼容的IndicTrans2模型进行推理,请参考 GitHub仓库 中的详细说明。
💻 使用示例
基础用法
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
src_lang, tgt_lang = "eng_Latn", "hin_Deva"
model_name = "ai4bharat/indictrans2-en-indic-dist-200M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.float16,
attn_implementation="flash_attention_2"
).to(DEVICE)
ip = IndicProcessor(inference=True)
input_sentences = [
"When I was young, I used to go to the park every day.",
"We watched a new movie last week, which was very inspiring.",
"If you had met me at that time, we would have gone out to eat.",
"My friend has invited me to his birthday party, and I will give him a gift.",
]
batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)
inputs = tokenizer(
batch,
truncation=True,
padding="longest",
return_tensors="pt",
return_attention_mask=True,
).to(DEVICE)
with torch.no_grad():
generated_tokens = model.generate(
**inputs,
use_cache=True,
min_length=0,
max_length=256,
num_beams=5,
num_return_sequences=1,
)
generated_tokens = tokenizer.batch_decode(
generated_tokens,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)
for input_sentence, translation in zip(input_sentences, translations):
print(f"{src_lang}: {input_sentence}")
print(f"{tgt_lang}: {translation}")
高级用法
- 长上下文IT2模型:
- 基于RoPE的新型IndicTrans2模型能够处理 长达2048个标记 的序列,可在 此处 获取。
- 只需更改
model_name
参数即可使用这些模型。有关生成的更多信息,请阅读RoPE - IT2模型的模型卡片。
- 建议使用
flash_attention_2
运行这些模型,以实现高效生成。
📚 详细文档
支持语言
属性 |
详情 |
支持的语言代码 |
as、bn、brx、doi、en、gom、gu、hi、kn、ks、kas、mai、ml、mr、mni、mnb、ne、or、pa、sa、sat、sd、snd、ta、te、ur |
语言详情 |
asm_Beng、ben_Beng、brx_Deva、doi_Deva、eng_Latn、gom_Deva、guj_Gujr、hin_Deva、kan_Knda、kas_Arab、kas_Deva、mai_Deva、mal_Mlym、mar_Deva、mni_Beng、mni_Mtei、npi_Deva、ory_Orya、pan_Guru、san_Deva、sat_Olck、snd_Arab、snd_Deva、tam_Taml、tel_Telu、urd_Arab |
相关标签
- indictrans2
- translation
- ai4bharat
- multilingual
许可证
本项目采用MIT许可证。
数据集
- flores - 200
- IN22 - Gen
- IN22 - Conv
评估指标
推理设置
推理功能已关闭。
📄 许可证
本项目使用MIT许可证。
📖 引用
如果您考虑使用我们的工作,请使用以下引用:
@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}