mlotsawa-ground-smallオープンソースモデル - 無料でデプロイして烏金体チベット語から英語への翻訳を実現

ホーム

Mlotsawa Ground Small

billingsmooreによって開発

T5-smallをファインチューニングしたチベット仏教文献翻訳モデルで、ウチェン体チベット文字から英語への翻訳に特化

機械翻訳

Transformers

複数言語対応オープンソースライセンス:MIT #チベット仏教翻訳 #T5ファインチューニング #ウチェン体チベット文字

ダウンロード数 33

リリース時間 : 4/23/2025

モデル概要

これは6000万パラメータの機械翻訳モデルで、チベット仏教文献をチベット語から英語に翻訳することに焦点を当てており、MLotsawaプロジェクトに属しています。

モデル特徴

仏教文献専門化

チベット仏教文献に特化して最適化され、仏教用語や表現方法を理解

拡張可能な基盤

基礎モデルとして、特定の宗派やより大規模なコーパスに適応するようファインチューニング可能

カスタムトークナイザー

getokトークナイザーを使用し、チベット語仏教テキストを専門的に処理

モデル能力

チベット語から英語翻訳

仏教文献翻訳

テキスト変換

使用事例

宗教文献翻訳

仏教経典翻訳

チベット仏教の経典文献を英語に翻訳

原文の意味を基本的に伝えられるが、人手による確認が必要

祈願文翻訳

チベット仏教の祈願文や祈りの言葉を翻訳

詩的なテキストを処理でき、基本的な韻律を保持

学術研究

文献前処理

学術研究のための予備的な翻訳参考を提供

研究補助ツールとして利用可能

🚀 mlotsawa-ground-small モデルカード

このモデルは、チベット仏教テキストを英語に翻訳するためのtransformers機械翻訳モデルです。より大規模なMLotsawaプロジェクトの一部として作成されました。

✨ 主な機能

このモデルは、チベット仏教テキストの英語への翻訳に使用できます。直接翻訳に利用することも、さらに性能向上のために微調整することも可能です。

📦 インストール

このモデルは、transformersライブラリを使用して簡単にインストールできます。以下のコードを実行して、モデルをパイプラインで使用できます。

from transformers import pipeline

pipe = pipeline('translation', 'billingsmoore/mlotsawa-ground-small', device='cpu') # select a device of your choice (i.e. 'cuda:0')

💻 使用例

基本的な使用法

以下のコードは、モデルを直接翻訳に使用する例です。

from transformers import pipeline

pipe = pipeline('translation', 'billingsmoore/mlotsawa-ground-small', device='cpu') # select a device of your choice (i.e. 'cuda:0')

input = ["ཁྱེད་ལ་བསྟོད་ཅིང་གསོལ་བ་བཏབ་པའི་མཐུས༔",
"བདག་གི་ཚེ་བསོད་དཔལ་འབྱོར་རྒྱས་པ་དང་༔",
"འཇིགས་པ་བཅུ་དྲུག་རྐྱེན་ངན་བར་ཆད་སོལ༔"]

output = pipe(input)

translation = [elt['translation_text'] for elt in output]

print(translation)

上記のコードを実行すると、以下のような出力が得られます。

['Through the power of praising and praying to you', 'Increase my lifespan merit and prosperity', 'Remove the sixteen fears and obstacles of adversity.']

高度な使用法

モデルの性能を向上させるために、追加の微調整を行うことができます。以下のコードは、モデルを微調整する例です。

# Load Your Data
from datasets import load_dataset

dataset = load_dataset(<your dataset>)

# Load the Model and Tokenizer
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("billingsmoore/mlotsawa-ground-small", device_map="cuda:0") # this line assumes you want to use a single CUDA enabled gpu
tokenizer = AutoTokenizer.from_pretrained('billingsmoore/mlotsawa-ground-small')
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Preprocess the Data
def translation_preprocess_function(examples):

    # Prepare translation inputs and targets
    translation_inputs = ['Translate Tibetan to English: ' + example for example in examples['bo']]
    translation_targets = [example for example in examples['en']]
    
    # Tokenize translation inputs and targets
    translation_model_inputs = tokenizer(translation_inputs, text_target=translation_targets, 
                                         max_length=256, truncation=True, padding="max_length")
    
    
    return translation_model_inputs

tokenized_dataset = dataset.map(translation_preprocess_function, batched=True)

# Define Evaluation Metrics
import numpy as np
import evaluate

# Load BLEU and CHRF metrics
bleu_metric = evaluate.load("sacrebleu")
chrf_metric = evaluate.load("chrf")
ter_metric = evaluate.load("ter")

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    
    # Decode predictions and labels
    preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Postprocess text
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    # Compute BLEU score
    bleu_result = bleu_metric.compute(predictions=decoded_preds, references=decoded_labels)
    bleu_score = bleu_result["score"]

    # Compute CHRF score
    chrf_result = chrf_metric.compute(predictions=decoded_preds, references=decoded_labels)
    chrf_score = chrf_result["score"]

    # Compute TER score
    ter_result = ter_metric.compute(predictions=decoded_preds, references=decoded_labels)
    ter_score = ter_result["score"]

    # Return rounded results
    metrics = {
        "bleu": round(bleu_score, 4),
        "chrf": round(chrf_score, 4),
        "ter": round(ter_score, 4)
    }

    #print("Computed Metrics:", metrics)

    return metrics

# Set Up Training Arguments and Optimizer
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, Adafactor, EarlyStoppingCallback
from accelerate import Accelerator

accelerator = Accelerator()

optimizer = Adafactor(
    model.parameters(), 
    scale_parameter=True, 
    relative_step=False, 
    warmup_init=False, 
    lr=3e-4
)

model, optimizer = accelerator.prepare(model, optimizer)

training_args = Seq2SeqTrainingArguments(
    output_dir=f"output-dir", # select an output directory of your choice
    auto_find_batch_size=True,
    predict_with_generate=True,
    fp16=False,
    push_to_hub=False,
    eval_strategy='epoch',
    save_strategy='epoch',
    num_train_epochs=100, # select your preferred number of training epochs
    load_best_model_at_end=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['dev'],
    processing_class=tokenizer,
    optimizers=(optimizer, None),
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback()]
)

trainer.train()

📚 ドキュメント

モデルの詳細

モデルの説明

このモデルは、6000万パラメータの微調整されたT5モデル（小サイズ）です。チベット仏教テキストの英語への翻訳を目的としています。入力はUchenスクリプトであることが想定されています。このモデルは、getok トークナイザーを使用しています。訓練データと手順の詳細は以下に記載されています。

このモデルは基礎モデルであり、性能はかなり良好ですが、より大きなコーパスまたは伝統固有（例えば、大圓滿）のコーパスでさらに微調整して、翻訳品質を向上させるためのベースとして使用することを想定しています。

属性	详情
開発者	billingsmoore
モデルタイプ	翻訳
言語	チベット語、英語
ライセンス	MIT
微調整元のモデル	google-t5/t5-small

モデルのソース

リポジトリ：GitHub上のMLotsawa

モデルの使用

このモデルは、直接翻訳に使用することも、性能向上のためにさらに微調整することもできます。

バイアス、リスク、制限事項

このモデルは、仏教テキストの翻訳を目的としています。この材料の複雑さと重要性から、すべての翻訳は予備的なものとして扱われ、経験豊富な人間の翻訳者の意見なしに使用してはなりません。さらに、このモデルはチベット仏教材料のみで訓練されており、他の材料（例えば、チベット語の日常会話）では良好な性能を発揮するとは期待できません。

訓練の詳細

訓練データ

このモデルの訓練データは、仏教テキストからの861,417の翻訳ペアです。このデータは、公開されている材料と、Monlam AIおよびチベット・ヒマラヤ図書館から提供された材料から収集されました。

訓練手順

モデルは、以下に説明するように、継続的な事前学習と微調整を行いました。

事前学習

モデルは、学習率3e-4で1エポックの間、訓練データで事前学習されました。事前学習の目的は、元のスパン破損ノイズ除去タスクのままで、入力トークンのランダムなスパンをマスクし、モデルは欠落した内容を再構築するように訓練されました。この事前学習により、モデルは新しいトークナイザーに適応し、チベット仏教材料の言語的および構造的特性を学習することができました。

微調整

モデルは、Adafactorオプティマイザーと初期学習率3e-4を使用して、翻訳ペアで50エポックの間微調整されました。

評価

モデルは、テストデータでBLEU、chrF、およびTERを使用して評価されました。結果は以下の通りです。

BLEU	chrF	TER
3.54	19.89	87.58

これらのスコアは非常に低いですが、実際の翻訳結果は比較的良好です。サンプル翻訳は以下に示されています。

「心を善に向けるためのアドバイス」より

著者：Khenchen Ngawang Palzang、翻訳：Joseph McClellan、編集支援：Ninjyed N.T.、2024年

原文	人間による翻訳	機械翻訳
གྲུབ་བརྒྱའི་སྤྱི་མེས་པཎ་ཆེན་བི་མ་ལ། ། བསམ་བཞིན་སྤྲུལ་པའི་ཟློས་གར་ཉེར་བཟུང་བ། ། རྒྱལ་བའི་དབང་པོ་ཀློང་ཆེན་རབ་འབྱམས་པ། ། འདི་ཙམ་མ་ཡིན་ཚེ་རབས་གཏན་གྱི་སྐྱབས། །	Grandsire of a hundred siddhas—great scholar, Vimalamitra, And you who fully embraced the spectacle of intentional emanation, Lord of conquerors, Longchen Rabjam— You are my unfailing refuge; not just now, but in the concatenation of my lives.	Great paṇḍita Vimalamitra, forefather of hundreds of siddhas, Manifesting in the form of a play, Lord of the victorious ones, Longchen Rabjam, Not just this but the constant refuge throughout all my lives,

「すべての恐怖からの保護真実の空行母の秘密蔵からの聖観音への祈り」より

著者：Sera Khandro、翻訳：Adam Pearcey、2025年

原文	人間による翻訳	機械翻訳
ཀ་དག་སྤྲོས་བྲལ་འོད་གསལ་རིག་པའི་དབྱིངས༔ ལྷུན་གྲུབ་སྣང་ཆ་མ་འགགས་སྒྱུ་འཕྲུལ་གར༔ ཐུགས་རྗེ་རྒྱལ་བ་ཀུན་གྱི་ཡུམ་གཅིག་མ༔ རྗེ་བཙུན་ཨཱརྗེ་ཏཱ་རེ་ཚེ་སྦྱིན་དཔལ༔ གསོལ་བ་འདེབས་སོ་རླུང་སེམས་དབང་བསྡུས་ནས༔ ཚེ་དང་བསོད་ནམས་འཕེལ་བར་མཛད་དུ་གསོལ༔	Out of the primordially pure unelaborate space of luminous awareness, As the magical manifestation of unobstructed spontaneous presence, Arises the compassionate one, the one and only mother of all victorious ones, Noble Lady Ārya Tārā, glorious bestower of longevity, To you I pray! Take control of my vital winds and mind, And increase my lifespan and merit!	Within the space of awareness—primordial purity free of elaboration— Illusory dance of spontaneously present appearances unceasing Only mother of all the buddhas of compassion Noble Ārya Tārā glorious Tārā To you I pray: bringing the vāyu-mind under control And increase our lifespan and merit.