mlotsawa-ground-small Open-source Model - Free Deployment for Translating Tibetan in Uchen Script to English

Mlotsawa Ground Small

Developed by billingsmoore

A Tibetan Buddhist literature translation model fine-tuned on T5-small, specialized for Uchen Tibetan script to English translation

Machine Translation

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Tibetan Buddhist translation #T5 fine-tuning #Uchen Tibetan script

Downloads 33

Release Time : 4/23/2025

Model Overview

This is a 60-million-parameter machine translation model focused on translating Tibetan Buddhist literature from Tibetan to English, part of the MLotsawa project.

Model Features

Buddhist literature specialization

Optimized specifically for Tibetan Buddhist literature, understanding Buddhist terminology and expressions

Extensible foundation

Serves as a base model that can be fine-tuned for specific sects or larger corpora

Custom tokenizer

Uses getok tokenizer specifically designed for Tibetan Buddhist texts

Model Capabilities

Tibetan to English translation

Buddhist literature translation

Text conversion

Use Cases

Religious literature translation

Buddhist scripture translation

Translating Tibetan Buddhist scriptures into English

Examples show basic conveyance of original meaning but require human review

Prayer text translation

Translating Tibetan Buddhist prayers and invocations

Can handle poetic texts while maintaining basic rhythm

Academic research

Literature preprocessing

Providing preliminary translation references for academic research

Can serve as a research aid tool

🚀 Model Card for mlotsawa-ground-small

This is a transformers machine translation model designed to translate Tibetan Buddhist texts into English. It's part of the broader MLotsawa project, offering a valuable tool for translating complex Buddhist materials.

✨ Features

Translation Model: Specifically tailored for translating Tibetan Buddhist texts to English.
Finetuned T5: Based on the small-sized T5 model with 60 million parameters, finetuned for better performance.
Ground Model: Can be used directly or as a base for further finetuning to enhance translation quality.

📦 Installation

There's no specific installation process mentioned in the original README. However, to use the model, you'll need to install the transformers library and other relevant dependencies as shown in the usage examples.

💻 Usage Examples

Basic Usage

from transformers import pipeline

pipe = pipeline('translation', 'billingsmoore/mlotsawa-ground-small', device='cpu') # select a device of your choice (i.e. 'cuda:0')

input = ["ཁྱེད་ལ་བསྟོད་ཅིང་གསོལ་བ་བཏབ་པའི་མཐུས༔",
"བདག་གི་ཚེ་བསོད་དཔལ་འབྱོར་རྒྱས་པ་དང་༔",
"འཇིགས་པ་བཅུ་དྲུག་རྐྱེན་ངན་བར་ཆད་སོལ༔"]

output = pipe(input)

translation = [elt['translation_text'] for elt in output]

print(translation)

Advanced Usage

# Load Your Data
from datasets import load_dataset

dataset = load_dataset(<your dataset>)

# Load the Model and Tokenizer
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("billingsmoore/mlotsawa-ground-small", device_map="cuda:0") # this line assumes you want to use a single CUDA enabled gpu
tokenizer = AutoTokenizer.from_pretrained('billingsmoore/mlotsawa-ground-small')
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Preprocess the Data
def translation_preprocess_function(examples):

    # Prepare translation inputs and targets
    translation_inputs = ['Translate Tibetan to English: ' + example for example in examples['bo']]
    translation_targets = [example for example in examples['en']]
    
    # Tokenize translation inputs and targets
    translation_model_inputs = tokenizer(translation_inputs, text_target=translation_targets, 
                                         max_length=256, truncation=True, padding="max_length")
    
    
    return translation_model_inputs

tokenized_dataset = dataset.map(translation_preprocess_function, batched=True)

# Define Evaluation Metrics
import numpy as np
import evaluate

# Load BLEU and CHRF metrics
bleu_metric = evaluate.load("sacrebleu")
chrf_metric = evaluate.load("chrf")
ter_metric = evaluate.load("ter")

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    
    # Decode predictions and labels
    preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Postprocess text
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    # Compute BLEU score
    bleu_result = bleu_metric.compute(predictions=decoded_preds, references=decoded_labels)
    bleu_score = bleu_result["score"]

    # Compute CHRF score
    chrf_result = chrf_metric.compute(predictions=decoded_preds, references=decoded_labels)
    chrf_score = chrf_result["score"]

    # Compute TER score
    ter_result = ter_metric.compute(predictions=decoded_preds, references=decoded_labels)
    ter_score = ter_result["score"]

    # Return rounded results
    metrics = {
        "bleu": round(bleu_score, 4),
        "chrf": round(chrf_score, 4),
        "ter": round(ter_score, 4)
    }

    #print("Computed Metrics:", metrics)

    return metrics

# Set Up Training Arguments and Optimizer
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, Adafactor, EarlyStoppingCallback
from accelerate import Accelerator

accelerator = Accelerator()

optimizer = Adafactor(
    model.parameters(), 
    scale_parameter=True, 
    relative_step=False, 
    warmup_init=False, 
    lr=3e-4
)

model, optimizer = accelerator.prepare(model, optimizer)

training_args = Seq2SeqTrainingArguments(
    output_dir=f"output-dir", # select an output directory of your choice
    auto_find_batch_size=True,
    predict_with_generate=True,
    fp16=False,
    push_to_hub=False,
    eval_strategy='epoch',
    save_strategy='epoch',
    num_train_epochs=100, # select your preferred number of training epochs
    load_best_model_at_end=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['dev'],
    processing_class=tokenizer,
    optimizers=(optimizer, None),
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback()]
)

trainer.train()

📚 Documentation

Model Details

Property	Details
Model Type	translation
Developed by	billingsmoore
Languages	Tibetan, English
License	MIT
Finetuned from model	google-t5/t5-small
Repository	MLotsawa on GitHub

Uses

Direct Use: Can be used directly for translation using a transformers pipeline.
Downstream Use: Can be further finetuned for improved performance on specific datasets.

Bias, Risks, and Limitations

⚠️ Important Note

This model is for translating Buddhist texts. All translations should be considered preliminary and used with the input of an experienced human translator. It was trained on Tibetan Buddhist material and may not perform well on other types of content.

Training Details

Training Data: 861,417 translation pairs from Buddhist texts, collected from public and private sources.
Training Procedure:
- Pretraining: One epoch on the training data with a learning rate of 3e-4, using the original span corruption denoising task.
- Finetuning: 50 epochs on the translation pairs using the Adafactor optimizer and an initial learning rate of 3e-4.

Evaluation

BLEU	chrF	TER
3.54	19.89	87.58

Sample translations are provided to show the actual performance of the model.

🔧 Technical Details

The model is a finetuned T5 model with 60 million parameters. It uses the getok tokenizer and expects input in Uchen script.

📄 License

This model is released under the MIT license.

Model Card Authors

billingsmoore

Model Card Contact

billingsmoore[at]gmail[dot]com

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご