indictrans2-indic-indic-1B開源翻譯模型 - 免費支持22種印度語言互譯

首頁

Indictrans2 Indic Indic 1B

由ai4bharat開發

這是一個支持22種印度語言互譯的1B參數模型，通過拼接印度-英語和英語-印度模型後調整得到。

機器翻譯

Transformers

開源協議:MIT #印度多語言互譯 #1B參數大模型 #22種印度語言

下載量 1,542

發布時間 : 11/28/2023

模型概述

該模型專注於印度22種官方語言之間的高質量機器翻譯，支持多種文字系統間的轉換。

模型特點

多語言支持

支持22種印度官方語言間的互譯，涵蓋多種文字系統

大模型規模

採用1B參數的大規模模型，提供更高質量的翻譯效果

文字系統轉換

能夠處理不同文字系統間的轉換，如天城文、孟加拉文、泰米爾文等

模型能力

印度語言互譯

多文字系統處理

批量翻譯

使用案例

跨語言交流

政府文件翻譯

將政府文件在不同印度語言間轉換

提高政府信息在不同語言群體中的可及性

教育材料本地化

將教育材料翻譯為不同地區的本地語言

促進教育資源的平等獲取

商業應用

多語言客戶支持

為印度不同語言用戶提供支持內容

提升客戶滿意度和市場覆蓋

🚀 IndicTrans2

IndicTrans2是一款經過拼接Indic - En 1B和En - Indic 1B變體後適配的印地語到印地語10億參數變體模型。它能夠助力解決多語言翻譯問題，為印度多種語言之間的交流提供高質量的翻譯支持。

✨ 主要特性

支持多語言：支持多種印度語言，包括as、bn、brx等。具體語言詳情為asm_Beng、ben_Beng、brx_Deva等。
多領域應用：基於多種數據集進行訓練，如flores - 200、IN22 - Gen、IN22 - Conv，適用於不同場景的翻譯。
多評估指標：使用多種評估指標，如bleu、chrf、chrf++、comet，保證翻譯質量。

📦 安裝指南

文檔未提及具體安裝步驟，暫不提供。

💻 使用示例

基礎用法

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "hin_Deva", "tam_Taml"
model_name = "ai4bharat/indictrans2-indic-indic-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।",
    "हमने पिछले सप्ताह एक नई फिल्म देखी जो कि बहुत प्रेरणादायक थी।",
    "अगर तुम मुझे उस समय पास मिलते, तो हम बाहर खाना खाने चलते।",
    "मेरे मित्र ने मुझे उसके जन्मदिन की पार्टी में बुलाया है, और मैं उसे एक तोहफा दूंगा।",
]

batch = ip.preprocess_batch(
    input_sentences,
    src_lang=src_lang,
    tgt_lang=tgt_lang,
)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=256,
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

📚 詳細文檔

請參考博客以獲取關於模型訓練、數據和評估指標的更多詳細信息。

如需瞭解如何使用與Hugging Face兼容的IndicTrans2模型進行推理，請參考GitHub倉庫。

📄 許可證

本項目採用MIT許可證。

📖 引用

如果您考慮使用我們的工作，請使用以下引用：

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}

📋 模型信息

屬性	詳情
模型類型	IndicTrans2印地語到印地語10億參數變體模型
訓練數據	flores - 200、IN22 - Gen、IN22 - Conv
評估指標	bleu、chrf、chrf++、comet