InLegalTrans-En2Indic-1B开源法律翻译模型 - 免费将英文印度法律文本转九种印语

首页

Inlegaltrans En2Indic 1B

由 law-ai 开发

InLegalTrans是基于IndicTrans2微调的法律文本翻译模型，专门用于将印度法律文本从英语翻译为九种印度语言。

机器翻译

Safetensors

支持多种语言开源协议:MIT #法律文本翻译 #多语言支持 #印度语言优化

下载量 81

发布时间 : 1/19/2025

模型简介

该模型专注于法律文本的英语到印度语言翻译，支持包括孟加拉语、印地语、马拉地语等九种印度语言，在MILPaC数据集上微调后性能显著提升。

模型特点

法律领域专业化

针对法律文本进行专门优化，在印度法律文本翻译任务上表现优于通用翻译模型

多语言支持

支持英语到九种印度语言的翻译，覆盖印度主要语言

高性能

在BLEU、GLEU和chrF++等指标上显著优于基础模型IndicTrans2

模型能力

英语到印度语言翻译

法律文本翻译

多语言机器翻译

使用案例

法律文件翻译

法律条文翻译

将英语法律条文翻译为印度地方语言

翻译质量显著优于通用翻译模型

法院文件翻译

翻译法院判决书等法律文件

保持法律文本的专业性和准确性

法律信息服务

多语言法律信息提供

为不同语言使用者提供法律信息服务

提高法律信息的可及性

🚀 InLegalTrans

这是 InLegalTrans-En2Indic-1B 翻译模型的介绍卡片。该模型是 IndicTrans2 模型的微调版本，专门用于将英文印度法律文本翻译成印度当地语言。

🚀 快速开始

本部分将为你介绍模型的基本信息、使用方法以及相关数据情况。

✨ 主要特性

针对性微调：基于 IndicTrans2 模型微调，专为英文到印度语言的法律文本翻译定制。
多语言支持：支持多种印度语言，包括孟加拉语（BN）、印地语（HI）、马拉地语（MR）、泰米尔语（TA）、泰卢固语（TE）、马拉雅拉姆语（ML）、旁遮普语（PA）、古吉拉特语（GU）和奥里亚语（OR）。
性能提升：在 MILPaC 语料库的测试集上，相较于 IndicTrans2 模型，各项评估指标均有显著提升。

📦 安装指南

使用该模型前，你需要安装相关依赖库，可参考以下代码中的导入部分进行安装：

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit import IndicProcessor # 从 https://github.com/VarunGumma/IndicTransToolkit 安装 IndicTransToolkit

💻 使用示例

基础用法

以下是使用 InLegalTrans 模型进行英文到孟加拉语翻译的示例代码：

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit import IndicProcessor # 从 https://github.com/VarunGumma/IndicTransToolkit 安装 IndicTransToolkit

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
src_lang, tgt_lang = "eng_Latn", "ben_Beng" # 使用 FLORES - 200 数据集的 BCP - 47 语言代码
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-1B", trust_remote_code=True) # 使用 IndicTrans2 分词器以启用其自定义分词脚本
model = AutoModelForSeq2SeqLM.from_pretrained(
    "law-ai/InLegalTrans-En2Indic-1B",
    trust_remote_code=True,
    attn_implementation="eager",
    low_cpu_mem_usage=True,
).to(device)
ip = IndicProcessor(inference=True)

input_sentences = [
    "(7) Any such allowance for the maintenance and expenses for proceeding shall be payable from the date of the order, or, if so ordered, from the date of the application for maintenance or expenses of proceeding, as the case may be.",
    "(2) Where it appears to the Tribunal that, in consequence of any decision of a competent Civil Court, any order made under section 9 should be cancelled or varied, it shall cancel the order or, as the case may be, vary the same accordingly.",
]

batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)

input_text_encoding = tokenizer(
    batch,
    max_length=256,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(device)

generated_tokens = model.generate(
    **input_text_encoding,
    max_length=256,
    do_sample=True,
    num_beams=4,
    num_return_sequences=1,
    early_stopping=False,
    use_cache=True,
)

with tokenizer.as_target_tokenizer():
    generated_tokens = tokenizer.batch_decode(
        generated_tokens.detach().cpu().tolist(),
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )

translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"Sentence in {src_lang} language: {input_sentence}") 
    print(f"Translated Sentence in {tgt_lang} language: {translation}")

📚 详细文档

训练数据

我们使用 MILPaC （多语言印度法律平行语料库） 语料库进行微调。这是首个高质量的印度法律平行语料库，包含英语（EN）和九种印度语言（IN）的平行对齐文本单元，这些印度语言分别是孟加拉语（BN）、印地语（HI）、马拉地语（MR）、泰米尔语（TA）、泰卢固语（TE）、马拉雅拉姆语（ML）、旁遮普语（PA）、古吉拉特语（GU）和奥里亚语（OR）。有关该语料库的更多详细信息，请参考论文。

在微调过程中，我们按语言将 MILPaC 随机划分为 80（训练集） - 10（验证集） - 10（测试集）的比例。我们使用 80% 的训练集（每个英语到印度语言对的 80% 组合）来微调 IndicTrans2 模型，并使用 10% 的验证集（每个英语到印度语言对的 10% 组合）来选择最佳检查点并防止过拟合。

模型概述和使用说明

InLegalTrans 模型使用与 IndicTrans2 模型相同的分词器，并且具有相同的架构，约有 11.2 亿个参数。

微调结果

以下表格展示了 InLegalTrans 模型与 IndicTrans2 模型在 MILPaC 10% 测试集上的性能对比结果。性能评估使用 BLEU、GLEU 和 chrF++ 指标。对于所有英语到印度语言的翻译对，InLegalTrans 模型相较于 IndicTrans2 模型都有显著提升，在所有评估指标上均表现更优。

英语到印度语言	模型	BLEU	GLEU	chrF++
英语到孟加拉语	IndicTrans2	25.4	28.8	53.7
	InLegalTrans	45.8	47.6	70.9
英语到印地语	IndicTrans2	41.0	42.5	59.9
	InLegalTrans	56.9	57.1	73.8
英语到马拉地语	IndicTrans2	25.2	28.7	55.4
	InLegalTrans	44.4	46.0	68.9
英语到泰米尔语	IndicTrans2	32.8	35.3	62.3
	InLegalTrans	40.0	42.5	69.9
英语到泰卢固语	IndicTrans2	10.7	14.2	37.9
	InLegalTrans	31.3	31.6	58.5
英语到马拉雅拉姆语	IndicTrans2	21.9	25.8	52.9
	InLegalTrans	37.4	40.3	69.7
英语到旁遮普语	IndicTrans2	27.8	31.6	51.5
	InLegalTrans	44.3	45.6	65.5
英语到古吉拉特语	IndicTrans2	27.5	31.1	55.7
	InLegalTrans	42.8	45.2	68.8
英语到奥里亚语	IndicTrans2	6.6	12.6	37.1
	InLegalTrans	14.2	19.9	47.5

引用说明

如果你使用了 InLegalTrans 翻译模型或 MILPaC 语料库，请引用以下论文：

@article{mahapatra2024milpacnovelbenchmarkevaluating,
      title = {MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages}, 
      author = {Sayan Mahapatra and Debtanu Datta and Shubham Soni and Adrijit Goswami and Saptarshi Ghosh},
      year = {2024},
      journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
      publisher = {Association for Computing Machinery},
}