InLegalTrans-En2Indic-1Bオープンソース法律翻訳モデル - 英語のインド法律テキストを9種類のインド語に無料で変換

ホーム

Inlegaltrans En2Indic 1B

law-aiによって開発

InLegalTransはIndicTrans2をファインチューニングした法律文書翻訳モデルで、英語から9つのインド言語への法律文書翻訳に特化しています。

機械翻訳

Safetensors

複数言語対応オープンソースライセンス:MIT #法律文書翻訳 #多言語サポート #インド言語最適化

ダウンロード数 81

リリース時間 : 1/19/2025

モデル概要

このモデルは英語からインド言語への法律文書翻訳に特化しており、ベンガル語、ヒンディー語、マラーティー語など9つのインド言語をサポートしています。MILPaCデータセットでファインチューニング後、性能が大幅に向上しました。

モデル特徴

法律分野専門化

法律文書に特化して最適化されており、インドの法律文書翻訳タスクで汎用翻訳モデルよりも優れた性能を発揮します

多言語サポート

英語から9つのインド言語への翻訳をサポートし、インドの主要言語をカバーしています

高性能

BLEU、GLEU、chrF++などの指標でベースモデルのIndicTrans2を大幅に上回ります

モデル能力

英語からインド言語への翻訳

法律文書翻訳

多言語機械翻訳

使用事例

法律文書翻訳

法律条文翻訳

英語の法律条文をインドの地方言語に翻訳

翻訳品質が汎用翻訳モデルよりも顕著に優れています

裁判所文書翻訳

判決書などの法律文書を翻訳

法律文書の専門性と正確性を保持します

法律情報サービス

多言語法律情報提供

異なる言語使用者に法律情報サービスを提供

法律情報のアクセシビリティを向上させます

🚀 InLegalTrans

このモデルは、IndicTrans2 モデルをファインチューニングした InLegalTrans-En2Indic-1B 翻訳モデルです。英語からインドの言語へのインドの法的テキストの翻訳に特化しています。

🚀 クイックスタート

このモデルは、英語からインドの言語への法的テキストの翻訳に最適化されています。以下のセクションでは、モデルのトレーニングデータ、概要、使用方法、ファインチューニング結果などについて説明します。

✨ 主な機能

英語からベンガル語、ヒンディー語、マラーティー語など9つのインドの言語への法的テキストの翻訳が可能です。
IndicTrans2 モデルをベースにファインチューニングされており、法的テキストに特化した高品質な翻訳を提供します。

📦 インストール

このモデルを使用するには、必要なライブラリをインストールする必要があります。以下のコードを参照してください。

# IndicTransToolkitをインストールする
from IndicTransToolkit import IndicProcessor # Install IndicTransToolkit from https://github.com/VarunGumma/IndicTransToolkit

💻 使用例

基本的な使用法

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit import IndicProcessor # Install IndicTransToolkit from https://github.com/VarunGumma/IndicTransToolkit

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
src_lang, tgt_lang = "eng_Latn", "ben_Beng" # Use the BCP-47 language codes used by the FLORES-200 dataset
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-1B", trust_remote_code=True) # Use IndicTrans2 tokenizer to enable their custom tokenization script to be run
model = AutoModelForSeq2SeqLM.from_pretrained(
    "law-ai/InLegalTrans-En2Indic-1B",
    trust_remote_code=True,
    attn_implementation="eager",
    low_cpu_mem_usage=True,
).to(device)
ip = IndicProcessor(inference=True)

input_sentences = [
    "(7) Any such allowance for the maintenance and expenses for proceeding shall be payable from the date of the order, or, if so ordered, from the date of the application for maintenance or expenses of proceeding, as the case may be.",
    "(2) Where it appears to the Tribunal that, in consequence of any decision of a competent Civil Court, any order made under section 9 should be cancelled or varied, it shall cancel the order or, as the case may be, vary the same accordingly.",
]

batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)

input_text_encoding = tokenizer(
    batch,
    max_length=256,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(device)

generated_tokens = model.generate(
    **input_text_encoding,
    max_length=256,
    do_sample=True,
    num_beams=4,
    num_return_sequences=1,
    early_stopping=False,
    use_cache=True,
)

with tokenizer.as_target_tokenizer():
    generated_tokens = tokenizer.batch_decode(
        generated_tokens.detach().cpu().tolist(),
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )

translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"Sentence in {src_lang} language: {input_sentence}") 
    print(f"Translated Sentence in {tgt_lang} language: {translation}")

📚 ドキュメント

トレーニングデータ

トレーニングには、MILPaC (Multilingual Indian Legal Parallel Corpus) コーパスを使用しています。これは、英語 (EN) と9つのインドの言語 (IN) の並列テキストユニットを含む、最初の高品質なインドの法的並列コーパスです。詳細については、論文を参照してください。

ファインチューニングのために、MILPaCを言語ごとに80 (トレーニング) - 10 (検証) - 10 (テスト) の比率でランダムに分割しました。80%のトレーニングデータを使用して IndicTrans2 モデルをファインチューニングし、10%の検証データを使用して最適なチェックポイントを選択し、過学習を防いでいます。

モデル概要

この InLegalTrans モデルは、IndicTrans2 モデルと同じトークナイザーを使用し、約11.2億のパラメータを持つ同じアーキテクチャを持っています。

ファインチューニング結果

以下の表は、InLegalTrans モデルと IndicTrans2 モデルの、MILPaC の10%のテストデータに対する性能結果を示しています。性能は、BLEU、GLEU、chrF++ のメトリクスを使用して評価されています。すべての英語からインドの言語へのペアにおいて、InLegalTrans は IndicTrans2 よりも大幅な改善を示し、すべての評価メトリクスで一貫して良好な性能を達成しています。

英語からインドの言語	モデル	BLEU	GLEU	chrF++
英語からベンガル語	IndicTrans2	25.4	28.8	53.7
	InLegalTrans	45.8	47.6	70.9
英語からヒンディー語	IndicTrans2	41.0	42.5	59.9
	InLegalTrans	56.9	57.1	73.8
英語からマラーティー語	IndicTrans2	25.2	28.7	55.4
	InLegalTrans	44.4	46.0	68.9
英語からタミル語	IndicTrans2	32.8	35.3	62.3
	InLegalTrans	40.0	42.5	69.9
英語からテルグ語	IndicTrans2	10.7	14.2	37.9
	InLegalTrans	31.3	31.6	58.5
英語からマラヤーラム語	IndicTrans2	21.9	25.8	52.9
	InLegalTrans	37.4	40.3	69.7
英語からパンジャーブ語	IndicTrans2	27.8	31.6	51.5
	InLegalTrans	44.3	45.6	65.5
英語からグジャラート語	IndicTrans2	27.5	31.1	55.7
	InLegalTrans	42.8	45.2	68.8
英語からオリヤー語	IndicTrans2	06.6	12.6	37.1
	InLegalTrans	14.2	19.9	47.5

引用

この InLegalTrans 翻訳モデルまたは MILPaC コーパスを使用する場合は、以下の論文を引用してください。

@article{mahapatra2024milpacnovelbenchmarkevaluating,
      title = {MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages}, 
      author = {Sayan Mahapatra and Debtanu Datta and Shubham Soni and Adrijit Goswami and Saptarshi Ghosh},
      year = {2024},
      journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
      publisher = {Association for Computing Machinery},
}

私たちについて

私たちは、インド工科大学 (IIT) カラグプル の自然言語処理 (NLP) 研究者のグループです。主な研究分野は、法的ドメインにおける機械学習 (ML)、ディープラーニング (DL)、およびNLPのアプリケーションであり、特にインドの法的シナリオにおける課題と機会に焦点を当てています。現在および過去のプロジェクトには、以下のようなものがあります。