🚀 Model Card for mlotsawa-ground-small
This is a transformers machine translation model designed to translate Tibetan Buddhist texts into English. It's part of the broader MLotsawa project, offering a valuable tool for translating complex Buddhist materials.
✨ Features
- Translation Model: Specifically tailored for translating Tibetan Buddhist texts to English.
- Finetuned T5: Based on the small-sized T5 model with 60 million parameters, finetuned for better performance.
- Ground Model: Can be used directly or as a base for further finetuning to enhance translation quality.
📦 Installation
There's no specific installation process mentioned in the original README. However, to use the model, you'll need to install the transformers
library and other relevant dependencies as shown in the usage examples.
💻 Usage Examples
Basic Usage
from transformers import pipeline
pipe = pipeline('translation', 'billingsmoore/mlotsawa-ground-small', device='cpu')
input = ["ཁྱེད་ལ་བསྟོད་ཅིང་གསོལ་བ་བཏབ་པའི་མཐུས༔",
"བདག་གི་ཚེ་བསོད་དཔལ་འབྱོར་རྒྱས་པ་དང་༔",
"འཇིགས་པ་བཅུ་དྲུག་རྐྱེན་ངན་བར་ཆད་སོལ༔"]
output = pipe(input)
translation = [elt['translation_text'] for elt in output]
print(translation)
Advanced Usage
from datasets import load_dataset
dataset = load_dataset(<your dataset>)
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("billingsmoore/mlotsawa-ground-small", device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained('billingsmoore/mlotsawa-ground-small')
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
def translation_preprocess_function(examples):
translation_inputs = ['Translate Tibetan to English: ' + example for example in examples['bo']]
translation_targets = [example for example in examples['en']]
translation_model_inputs = tokenizer(translation_inputs, text_target=translation_targets,
max_length=256, truncation=True, padding="max_length")
return translation_model_inputs
tokenized_dataset = dataset.map(translation_preprocess_function, batched=True)
import numpy as np
import evaluate
bleu_metric = evaluate.load("sacrebleu")
chrf_metric = evaluate.load("chrf")
ter_metric = evaluate.load("ter")
def postprocess_text(preds, labels):
preds = [pred.strip() for pred in preds]
labels = [[label.strip()] for label in labels]
return preds, labels
def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
bleu_result = bleu_metric.compute(predictions=decoded_preds, references=decoded_labels)
bleu_score = bleu_result["score"]
chrf_result = chrf_metric.compute(predictions=decoded_preds, references=decoded_labels)
chrf_score = chrf_result["score"]
ter_result = ter_metric.compute(predictions=decoded_preds, references=decoded_labels)
ter_score = ter_result["score"]
metrics = {
"bleu": round(bleu_score, 4),
"chrf": round(chrf_score, 4),
"ter": round(ter_score, 4)
}
return metrics
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, Adafactor, EarlyStoppingCallback
from accelerate import Accelerator
accelerator = Accelerator()
optimizer = Adafactor(
model.parameters(),
scale_parameter=True,
relative_step=False,
warmup_init=False,
lr=3e-4
)
model, optimizer = accelerator.prepare(model, optimizer)
training_args = Seq2SeqTrainingArguments(
output_dir=f"output-dir",
auto_find_batch_size=True,
predict_with_generate=True,
fp16=False,
push_to_hub=False,
eval_strategy='epoch',
save_strategy='epoch',
num_train_epochs=100,
load_best_model_at_end=True,
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset['train'],
eval_dataset=tokenized_dataset['dev'],
processing_class=tokenizer,
optimizers=(optimizer, None),
data_collator=data_collator,
compute_metrics=compute_metrics,
callbacks=[EarlyStoppingCallback()]
)
trainer.train()
📚 Documentation
Model Details
Property |
Details |
Model Type |
translation |
Developed by |
billingsmoore |
Languages |
Tibetan, English |
License |
MIT |
Finetuned from model |
google-t5/t5-small |
Repository |
MLotsawa on GitHub |
Uses
- Direct Use: Can be used directly for translation using a transformers pipeline.
- Downstream Use: Can be further finetuned for improved performance on specific datasets.
Bias, Risks, and Limitations
⚠️ Important Note
This model is for translating Buddhist texts. All translations should be considered preliminary and used with the input of an experienced human translator. It was trained on Tibetan Buddhist material and may not perform well on other types of content.
Training Details
- Training Data: 861,417 translation pairs from Buddhist texts, collected from public and private sources.
- Training Procedure:
- Pretraining: One epoch on the training data with a learning rate of 3e-4, using the original span corruption denoising task.
- Finetuning: 50 epochs on the translation pairs using the Adafactor optimizer and an initial learning rate of 3e-4.
Evaluation
BLEU |
chrF |
TER |
3.54 |
19.89 |
87.58 |
Sample translations are provided to show the actual performance of the model.
🔧 Technical Details
The model is a finetuned T5 model with 60 million parameters. It uses the getok tokenizer and expects input in Uchen script.
📄 License
This model is released under the MIT license.
Model Card Authors
billingsmoore
Model Card Contact
billingsmoore[at]gmail[dot]com