IndicTrans2开源多语言机器翻译模型 - 免费实现英语与22种印度语言互译

首页

Indictrans2 En Indic 1B

由 ai4bharat 开发

IndicTrans2是一个高质量的多语言机器翻译模型，支持英语与22种印度语言之间的互译

机器翻译

Transformers

支持多种语言开源协议:MIT #印度多语言翻译 #22种印度语言 #高精度机器翻译

下载量 106.30k

发布时间 : 9/9/2023

模型简介

该模型专注于英语与22种印度语言之间的高质量机器翻译，采用1.1B参数规模的架构，支持多种印度文字系统。

模型特点

多语言支持

支持22种印度语言与英语之间的互译，涵盖多种文字系统

高质量翻译

采用1.1B参数规模，提供高质量的翻译结果

长上下文支持

RoPE变体模型可处理最长2048个标记的序列

多种文字系统处理

能处理包括天城文、孟加拉文、阿拉伯文等多种印度文字系统

模型能力

英语到印度语言翻译

印度语言到英语翻译

多语言机器翻译

长文本翻译

使用案例

跨语言交流

政府文件翻译

将政府文件在英语和印度地方语言之间转换

提高政府信息在多元语言环境中的可及性

教育内容本地化

将教育材料翻译为不同印度语言

促进教育资源在不同语言群体中的传播

商业应用

多语言客户支持

为企业提供多语言客户服务内容翻译

扩大企业在多元语言市场的覆盖范围

🚀 IndicTrans2

IndicTrans2是一款用于英-印地语翻译的模型，其1.1B变体具有出色的性能。该模型支持多种印度语言的翻译，为印度地区的语言交流提供了强大的支持。

以下是该特定检查点的评估指标。

有关模型训练、预期用途、数据、指标、局限性和建议的更多详细信息，请参考预印本的附录D：模型卡片。

✨ 主要特性

多语言支持：支持多种印度语言，包括阿萨姆语、孟加拉语、博多语等。
高质量翻译：基于大规模数据集训练，提供高质量的翻译结果。
长文本处理：新的基于RoPE的模型能够处理长达2048个标记的序列。

📦 安装指南

文档中未提及具体安装步骤，可参考github仓库获取详细信息。

💻 使用示例

基础用法

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "eng_Latn", "hin_Deva"
model_name = "ai4bharat/indictrans2-en-indic-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "When I was young, I used to go to the park every day.",
    "We watched a new movie last week, which was very inspiring.",
    "If you had met me at that time, we would have gone out to eat.",
    "My friend has invited me to his birthday party, and I will give him a gift.",
]

batch = ip.preprocess_batch(
    input_sentences,
    src_lang=src_lang,
    tgt_lang=tgt_lang,
)

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=256,
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

高级用法

新的基于RoPE的IndicTrans2模型能够处理长达2048个标记的序列，可通过更改model_name参数使用这些模型。建议使用flash_attention_2以实现高效生成。

# 高级场景说明：使用基于RoPE的IndicTrans2模型进行长文本翻译
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "eng_Latn", "hin_Deva"
# 更改model_name参数以使用基于RoPE的模型
model_name = "rope_model_name" 
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "Long text input here...",
]

batch = ip.preprocess_batch(
    input_sentences,
    src_lang=src_lang,
    tgt_lang=tgt_lang,
)

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=2048, # 支持更长的序列长度
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

📚 详细文档

请参考github仓库，获取关于如何使用与Hugging Face兼容的IndicTrans2模型进行推理的详细描述。

🔧 技术细节

支持语言

属性	详情
支持语言代码	as, bn, brx, doi, en, gom, gu, hi, kn, ks, kas, mai, ml, mr, mni, mnb, ne, or, pa, sa, sat, sd, snd, ta, te, ur
语言详情	asm_Beng, ben_Beng, brx_Deva, doi_Deva, eng_Latn, gom_Deva, guj_Gujr, hin_Deva, kan_Knda, kas_Arab, kas_Deva, mai_Deva, mal_Mlym, mar_Deva, mni_Beng, mni_Mtei, npi_Deva, ory_Orya, pan_Guru, san_Deva, sat_Olck, snd_Arab, snd_Deva, tam_Taml, tel_Telu, urd_Arab

数据集

flores-200
IN22-Gen
IN22-Conv

评估指标

bleu
chrf
chrf++
comet

📄 许可证

本项目采用MIT许可证。

📖 引用

如果您考虑使用我们的工作，请使用以下引用：

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}