indictrans2-indic-indic-1B开源翻译模型 - 免费支持22种印度语言互译

首页

Indictrans2 Indic Indic 1B

由 ai4bharat 开发

这是一个支持22种印度语言互译的1B参数模型，通过拼接印度-英语和英语-印度模型后调整得到。

机器翻译

Transformers

开源协议:MIT #印度多语言互译 #1B参数大模型 #22种印度语言

下载量 1,542

发布时间 : 11/28/2023

模型简介

该模型专注于印度22种官方语言之间的高质量机器翻译，支持多种文字系统间的转换。

模型特点

多语言支持

支持22种印度官方语言间的互译，涵盖多种文字系统

大模型规模

采用1B参数的大规模模型，提供更高质量的翻译效果

文字系统转换

能够处理不同文字系统间的转换，如天城文、孟加拉文、泰米尔文等

模型能力

印度语言互译

多文字系统处理

批量翻译

使用案例

跨语言交流

政府文件翻译

将政府文件在不同印度语言间转换

提高政府信息在不同语言群体中的可及性

教育材料本地化

将教育材料翻译为不同地区的本地语言

促进教育资源的平等获取

商业应用

多语言客户支持

为印度不同语言用户提供支持内容

提升客户满意度和市场覆盖

🚀 IndicTrans2

IndicTrans2是一款经过拼接Indic - En 1B和En - Indic 1B变体后适配的印地语到印地语10亿参数变体模型。它能够助力解决多语言翻译问题，为印度多种语言之间的交流提供高质量的翻译支持。

✨ 主要特性

支持多语言：支持多种印度语言，包括as、bn、brx等。具体语言详情为asm_Beng、ben_Beng、brx_Deva等。
多领域应用：基于多种数据集进行训练，如flores - 200、IN22 - Gen、IN22 - Conv，适用于不同场景的翻译。
多评估指标：使用多种评估指标，如bleu、chrf、chrf++、comet，保证翻译质量。

📦 安装指南

文档未提及具体安装步骤，暂不提供。

💻 使用示例

基础用法

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "hin_Deva", "tam_Taml"
model_name = "ai4bharat/indictrans2-indic-indic-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।",
    "हमने पिछले सप्ताह एक नई फिल्म देखी जो कि बहुत प्रेरणादायक थी।",
    "अगर तुम मुझे उस समय पास मिलते, तो हम बाहर खाना खाने चलते।",
    "मेरे मित्र ने मुझे उसके जन्मदिन की पार्टी में बुलाया है, और मैं उसे एक तोहफा दूंगा।",
]

batch = ip.preprocess_batch(
    input_sentences,
    src_lang=src_lang,
    tgt_lang=tgt_lang,
)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=256,
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

📚 详细文档

请参考博客以获取关于模型训练、数据和评估指标的更多详细信息。

如需了解如何使用与Hugging Face兼容的IndicTrans2模型进行推理，请参考GitHub仓库。

📄 许可证

本项目采用MIT许可证。

📖 引用

如果您考虑使用我们的工作，请使用以下引用：

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}

📋 模型信息

属性	详情
模型类型	IndicTrans2印地语到印地语10亿参数变体模型
训练数据	flores - 200、IN22 - Gen、IN22 - Conv
评估指标	bleu、chrf、chrf++、comet