prototype-tibetan-to-english-translation-v1オープンソース翻訳モデル

ホーム

Prototype Tibetan To English Translation V1

billingsmooreによって開発

これはチベット語の文学作品を英語に翻訳するためのニューラル機械翻訳モデルで、チベット語から英語への翻訳の難題を効果的に解決します。

機械翻訳

Transformers

複数言語対応オープンソースライセンス:CC #チベット語文献の翻訳 #T5微調整モデル #異文化交流

ダウンロード数 170

リリース時間 : 9/30/2024

モデル概要

このモデルは、チベット語のテキスト（チベット文字またはTHL簡易音訳の転写テキスト）を英語に翻訳でき、チベット語文献の伝播と交流をサポートします。

モデル特徴

複数フォーマットの入力サポート

チベット文字とTHL簡易音訳テキストを入力としてサポートします

高精度翻訳

T5 - largeモデルに基づく微調整により、高品質のチベット語 - 英語翻訳を提供します

柔軟な展開

MLotsawaソフトウェアに統合でき、Python環境で独立して使用することもできます

モデル能力

チベット語から英語へのテキスト翻訳

音訳テキスト入力のサポート

バッチ翻訳のサポート

使用事例

文献翻訳

チベット語文献の翻訳

古典的なチベット語文献を英語に翻訳する

チベット語文献の国際的な伝播を促進する

文化交流

文化交流支援

非チベット語使用者がチベット語の内容を理解するのを助ける

異文化交流を促進する

🚀 チベット語 - 英語翻訳モデル

このモデルは、文学的なチベット語を英語に翻訳するためのニューラル機械翻訳モデルです。入力としてチベット文字またはTHL簡易音声翻字法に従って翻字されたチベット語のテキストを受け取り、英語の翻訳を出力します。この作品は、Creative Commons Attribution - NonCommercial 4.0 Internationalのライセンスの下で提供されています。

🚀 クイックスタート

このモデルは、大規模なMLotsawaソフトウェア内の翻訳モデルとして使用することを意図していますが、JupyterノートブックやPythonスクリプトでも使用できます。

✨ 主な機能

文学的なチベット語を英語に翻訳する能力を持つ。
チベット文字または翻字されたチベット語テキストを入力として受け付ける。
大規模なMLotsawaソフトウェアやJupyterノートブック、Pythonスクリプトでの使用が可能。

📦 インストール

インストールに関する具体的な手順は原ドキュメントに記載されていないため、このセクションをスキップします。

💻 使用例

基本的な使用法

このモデルを翻訳に使用するには、以下のコードを使用できます。

from transformers import pipeline

translator = pipeline('translation', 'billingsmoore/tibetan-to-english-translation')

input_text = <your transliterated Tibetan text>

translation = translator(input_text)

print(translation)

高度な使用法

このモデルをさらに微調整するには、以下のコードを使用できます。

from datasets import load_dataset
from transformers import (
  AutoTokenizer, DataCollatorForSeq2Seq,
  AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments,
  Seq2SeqTrainer, EarlyStoppingCallback, Adafactor
)
import evaluate
import numpy as np
from accelerate import Accelerator

data = load_dataset(<path_to_your_dataset>)

checkpoint = "billingsmoore/tibetan-to-english-translation"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

source_lang = 'bo'
target_lang = 'en'
prefix = "translate Tibetan to English: "

def preprocess_function(examples):

    inputs = [prefix + example[source_lang] for example in examples['translation']]
    targets = [example[target_lang] for example in examples['translation']]
    
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)

    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True)

metric = evaluate.load("sacrebleu")

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

early_stop = EarlyStoppingCallback()

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, device_map="auto")

optimizer = Adafactor(
    model.parameters(), 
    scale_parameter=True, 
    relative_step=False, 
    warmup_init=False, 
    lr=3e-4
)

training_args = Seq2SeqTrainingArguments(
    output_dir=".",
    auto_find_batch_size=True,
    predict_with_generate=True,
    fp16=False, #check this
    push_to_hub=False,
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
    tokenizer=tokenizer,
    optimizers=(optimizer, None),
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[early_stop]
)

trainer.train()

📚 ドキュメント

モデルの詳細

モデルの説明

このモデルは、7億7000万のパラメータを持つ微調整されたT5モデルです。

属性	详情
開発者	billingsmoore
言語 (NLP)	チベット語、英語
ライセンス	Attribution - NonCommercial 4.0 International
微調整元のモデル	'google - t5/t5 - large'

モデルのソース

リポジトリ：MLotsawa on Github

トレーニングの詳細

トレーニングデータ

このプロジェクトのトレーニングデータはこちらで入手できます。このデータセットは、10万組の文章またはフレーズで構成されています。各組の最初の要素は古典チベット語の文章またはフレーズで、2番目の要素はその英語の翻訳です。これらの組は、Lotsawa House (lotsawahouse.org) から取得したテキストから抽出されており、元のテキストと同じライセンスの下で提供されています。このデータは、プログラムによってスクレイピング、クリーニング、およびフォーマットされています。

トレーニング手順

t5トークナイザーは、['billingsmoore/tibetan - phonetic - transliteration'](https://huggingface.co/billingsmoore/tibetan - phonetic - transliteration) と同じ方法で更新されており、その手順はそのモデルカードに記載されています。['billingsmoore/phonetic - tibetan - to - english - translation'](https://huggingface.co/billingsmoore/phonetic - tibetan - to - english - translation) の完全なトレーニングがそのモデルカードに記載されている以外に、このモデルは['billingsmoore/tibetan - to - english - translation - dataset'](https://huggingface.co/datasets/billingsmoore/tibetan - to - english - translation - dataset) データセットで9エポックトレーニングされました。

トレーニングハイパーパラメータ

このモデルは、学習率2e - 5のAdafactorオプティマイザーを使用してトレーニングされました。

評価

このモデルの評価指標は、sacreBLEU で実装されたBLEUスコアです。BLEU (Bilingual Evaluation Understudy) スコアは、機械翻訳の品質を人間による参照翻訳と比較することで測定します。スコアは0から100の範囲で、100は参照翻訳と完全一致を表します。生成されたテキスト内のn - gram（単語シーケンス）の精度を評価し、スコアが高いほど参照翻訳との一致度が高いことを示します。短すぎる翻訳を防ぐために、短縮ペナルティが適用されます。