IndicTrans2開源多語言機器翻譯模型 - 免費實現英語與22種印度語言互譯

首頁

Indictrans2 En Indic 1B

由ai4bharat開發

IndicTrans2是一個高質量的多語言機器翻譯模型，支持英語與22種印度語言之間的互譯

機器翻譯

Transformers

支持多種語言開源協議:MIT #印度多語言翻譯 #22種印度語言 #高精度機器翻譯

下載量 106.30k

發布時間 : 9/9/2023

模型概述

該模型專注於英語與22種印度語言之間的高質量機器翻譯，採用1.1B參數規模的架構，支持多種印度文字系統。

模型特點

多語言支持

支持22種印度語言與英語之間的互譯，涵蓋多種文字系統

高質量翻譯

採用1.1B參數規模，提供高質量的翻譯結果

長上下文支持

RoPE變體模型可處理最長2048個標記的序列

多種文字系統處理

能處理包括天城文、孟加拉文、阿拉伯文等多種印度文字系統

模型能力

英語到印度語言翻譯

印度語言到英語翻譯

多語言機器翻譯

長文本翻譯

使用案例

跨語言交流

政府文件翻譯

將政府文件在英語和印度地方語言之間轉換

提高政府信息在多元語言環境中的可及性

教育內容本地化

將教育材料翻譯為不同印度語言

促進教育資源在不同語言群體中的傳播

商業應用

多語言客戶支持

為企業提供多語言客戶服務內容翻譯

擴大企業在多元語言市場的覆蓋範圍

🚀 IndicTrans2

IndicTrans2是一款用於英-印地語翻譯的模型，其1.1B變體具有出色的性能。該模型支持多種印度語言的翻譯，為印度地區的語言交流提供了強大的支持。

以下是該特定檢查點的評估指標。

有關模型訓練、預期用途、數據、指標、侷限性和建議的更多詳細信息，請參考預印本的附錄D：模型卡片。

✨ 主要特性

多語言支持：支持多種印度語言，包括阿薩姆語、孟加拉語、博多語等。
高質量翻譯：基於大規模數據集訓練，提供高質量的翻譯結果。
長文本處理：新的基於RoPE的模型能夠處理長達2048個標記的序列。

📦 安裝指南

文檔中未提及具體安裝步驟，可參考github倉庫獲取詳細信息。

💻 使用示例

基礎用法

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "eng_Latn", "hin_Deva"
model_name = "ai4bharat/indictrans2-en-indic-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "When I was young, I used to go to the park every day.",
    "We watched a new movie last week, which was very inspiring.",
    "If you had met me at that time, we would have gone out to eat.",
    "My friend has invited me to his birthday party, and I will give him a gift.",
]

batch = ip.preprocess_batch(
    input_sentences,
    src_lang=src_lang,
    tgt_lang=tgt_lang,
)

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=256,
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

高級用法

新的基於RoPE的IndicTrans2模型能夠處理長達2048個標記的序列，可通過更改model_name參數使用這些模型。建議使用flash_attention_2以實現高效生成。

# 高級場景說明：使用基於RoPE的IndicTrans2模型進行長文本翻譯
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "eng_Latn", "hin_Deva"
# 更改model_name參數以使用基於RoPE的模型
model_name = "rope_model_name" 
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "Long text input here...",
]

batch = ip.preprocess_batch(
    input_sentences,
    src_lang=src_lang,
    tgt_lang=tgt_lang,
)

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=2048, # 支持更長的序列長度
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

📚 詳細文檔

請參考github倉庫，獲取關於如何使用與Hugging Face兼容的IndicTrans2模型進行推理的詳細描述。

🔧 技術細節

支持語言

屬性	詳情
支持語言代碼	as, bn, brx, doi, en, gom, gu, hi, kn, ks, kas, mai, ml, mr, mni, mnb, ne, or, pa, sa, sat, sd, snd, ta, te, ur
語言詳情	asm_Beng, ben_Beng, brx_Deva, doi_Deva, eng_Latn, gom_Deva, guj_Gujr, hin_Deva, kan_Knda, kas_Arab, kas_Deva, mai_Deva, mal_Mlym, mar_Deva, mni_Beng, mni_Mtei, npi_Deva, ory_Orya, pan_Guru, san_Deva, sat_Olck, snd_Arab, snd_Deva, tam_Taml, tel_Telu, urd_Arab

數據集

flores-200
IN22-Gen
IN22-Conv

評估指標

bleu
chrf
chrf++
comet

📄 許可證

本項目採用MIT許可證。

📖 引用

如果您考慮使用我們的工作，請使用以下引用：

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}