InLegalTrans-En2Indic-1B開源法律翻譯模型 - 免費將英文印度法律文本轉九種印語

首頁

Inlegaltrans En2Indic 1B

由law-ai開發

InLegalTrans是基於IndicTrans2微調的法律文本翻譯模型，專門用於將印度法律文本從英語翻譯為九種印度語言。

機器翻譯

Safetensors

支持多種語言開源協議:MIT #法律文本翻譯 #多語言支持 #印度語言優化

下載量 81

發布時間 : 1/19/2025

模型概述

該模型專注於法律文本的英語到印度語言翻譯，支持包括孟加拉語、印地語、馬拉地語等九種印度語言，在MILPaC數據集上微調後性能顯著提升。

模型特點

法律領域專業化

針對法律文本進行專門優化，在印度法律文本翻譯任務上表現優於通用翻譯模型

多語言支持

支持英語到九種印度語言的翻譯，覆蓋印度主要語言

高性能

在BLEU、GLEU和chrF++等指標上顯著優於基礎模型IndicTrans2

模型能力

英語到印度語言翻譯

法律文本翻譯

多語言機器翻譯

使用案例

法律文件翻譯

法律條文翻譯

將英語法律條文翻譯為印度地方語言

翻譯質量顯著優於通用翻譯模型

法院文件翻譯

翻譯法院判決書等法律文件

保持法律文本的專業性和準確性

法律信息服務

多語言法律信息提供

為不同語言使用者提供法律信息服務

提高法律信息的可及性

🚀 InLegalTrans

這是 InLegalTrans-En2Indic-1B 翻譯模型的介紹卡片。該模型是 IndicTrans2 模型的微調版本，專門用於將英文印度法律文本翻譯成印度當地語言。

🚀 快速開始

本部分將為你介紹模型的基本信息、使用方法以及相關數據情況。

✨ 主要特性

針對性微調：基於 IndicTrans2 模型微調，專為英文到印度語言的法律文本翻譯定製。
多語言支持：支持多種印度語言，包括孟加拉語（BN）、印地語（HI）、馬拉地語（MR）、泰米爾語（TA）、泰盧固語（TE）、馬拉雅拉姆語（ML）、旁遮普語（PA）、古吉拉特語（GU）和奧里亞語（OR）。
性能提升：在 MILPaC 語料庫的測試集上，相較於 IndicTrans2 模型，各項評估指標均有顯著提升。

📦 安裝指南

使用該模型前，你需要安裝相關依賴庫，可參考以下代碼中的導入部分進行安裝：

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit import IndicProcessor # 從 https://github.com/VarunGumma/IndicTransToolkit 安裝 IndicTransToolkit

💻 使用示例

基礎用法

以下是使用 InLegalTrans 模型進行英文到孟加拉語翻譯的示例代碼：

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit import IndicProcessor # 從 https://github.com/VarunGumma/IndicTransToolkit 安裝 IndicTransToolkit

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
src_lang, tgt_lang = "eng_Latn", "ben_Beng" # 使用 FLORES - 200 數據集的 BCP - 47 語言代碼
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-1B", trust_remote_code=True) # 使用 IndicTrans2 分詞器以啟用其自定義分詞腳本
model = AutoModelForSeq2SeqLM.from_pretrained(
    "law-ai/InLegalTrans-En2Indic-1B",
    trust_remote_code=True,
    attn_implementation="eager",
    low_cpu_mem_usage=True,
).to(device)
ip = IndicProcessor(inference=True)

input_sentences = [
    "(7) Any such allowance for the maintenance and expenses for proceeding shall be payable from the date of the order, or, if so ordered, from the date of the application for maintenance or expenses of proceeding, as the case may be.",
    "(2) Where it appears to the Tribunal that, in consequence of any decision of a competent Civil Court, any order made under section 9 should be cancelled or varied, it shall cancel the order or, as the case may be, vary the same accordingly.",
]

batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)

input_text_encoding = tokenizer(
    batch,
    max_length=256,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(device)

generated_tokens = model.generate(
    **input_text_encoding,
    max_length=256,
    do_sample=True,
    num_beams=4,
    num_return_sequences=1,
    early_stopping=False,
    use_cache=True,
)

with tokenizer.as_target_tokenizer():
    generated_tokens = tokenizer.batch_decode(
        generated_tokens.detach().cpu().tolist(),
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )

translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"Sentence in {src_lang} language: {input_sentence}") 
    print(f"Translated Sentence in {tgt_lang} language: {translation}")

📚 詳細文檔

訓練數據

我們使用 MILPaC （多語言印度法律平行語料庫） 語料庫進行微調。這是首個高質量的印度法律平行語料庫，包含英語（EN）和九種印度語言（IN）的平行對齊文本單元，這些印度語言分別是孟加拉語（BN）、印地語（HI）、馬拉地語（MR）、泰米爾語（TA）、泰盧固語（TE）、馬拉雅拉姆語（ML）、旁遮普語（PA）、古吉拉特語（GU）和奧里亞語（OR）。有關該語料庫的更多詳細信息，請參考論文。

在微調過程中，我們按語言將 MILPaC 隨機劃分為 80（訓練集） - 10（驗證集） - 10（測試集）的比例。我們使用 80% 的訓練集（每個英語到印度語言對的 80% 組合）來微調 IndicTrans2 模型，並使用 10% 的驗證集（每個英語到印度語言對的 10% 組合）來選擇最佳檢查點並防止過擬合。

模型概述和使用說明

InLegalTrans 模型使用與 IndicTrans2 模型相同的分詞器，並且具有相同的架構，約有 11.2 億個參數。

微調結果

以下表格展示了 InLegalTrans 模型與 IndicTrans2 模型在 MILPaC 10% 測試集上的性能對比結果。性能評估使用 BLEU、GLEU 和 chrF++ 指標。對於所有英語到印度語言的翻譯對，InLegalTrans 模型相較於 IndicTrans2 模型都有顯著提升，在所有評估指標上均表現更優。

英語到印度語言	模型	BLEU	GLEU	chrF++
英語到孟加拉語	IndicTrans2	25.4	28.8	53.7
	InLegalTrans	45.8	47.6	70.9
英語到印地語	IndicTrans2	41.0	42.5	59.9
	InLegalTrans	56.9	57.1	73.8
英語到馬拉地語	IndicTrans2	25.2	28.7	55.4
	InLegalTrans	44.4	46.0	68.9
英語到泰米爾語	IndicTrans2	32.8	35.3	62.3
	InLegalTrans	40.0	42.5	69.9
英語到泰盧固語	IndicTrans2	10.7	14.2	37.9
	InLegalTrans	31.3	31.6	58.5
英語到馬拉雅拉姆語	IndicTrans2	21.9	25.8	52.9
	InLegalTrans	37.4	40.3	69.7
英語到旁遮普語	IndicTrans2	27.8	31.6	51.5
	InLegalTrans	44.3	45.6	65.5
英語到古吉拉特語	IndicTrans2	27.5	31.1	55.7
	InLegalTrans	42.8	45.2	68.8
英語到奧里亞語	IndicTrans2	6.6	12.6	37.1
	InLegalTrans	14.2	19.9	47.5

引用說明

如果你使用了 InLegalTrans 翻譯模型或 MILPaC 語料庫，請引用以下論文：

@article{mahapatra2024milpacnovelbenchmarkevaluating,
      title = {MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages}, 
      author = {Sayan Mahapatra and Debtanu Datta and Shubham Soni and Adrijit Goswami and Saptarshi Ghosh},
      year = {2024},
      journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
      publisher = {Association for Computing Machinery},
}