๐ Khmer mT5 Summarization Model
This repository houses a fine - tuned mT5 model for Khmer text summarization. Based on Google's [mT5 - small](https://huggingface.co/google/mt5 - small), the model has been fine - tuned on a dataset of Khmer text and their corresponding summaries. Fine - tuning was carried out using the Hugging Face Trainer
API, optimizing the model to generate concise and meaningful summaries of Khmer text.
๐ Quick Start
โจ Features
- Base Model:
google/mt5 - small
- Fine - tuned for: Khmer text summarization
- Training Dataset:
kimleang123/khmer - text - dataset
- Framework: Hugging Face
transformers
- Task Type: Sequence - to - Sequence (Seq2Seq)
- Input: Khmer text (articles, paragraphs, or documents)
- Output: Summarized Khmer text
- Training Hardware: GPU (Tesla T4)
- Evaluation Metric: ROUGE Score
๐ฆ Installation
1๏ธโฃ Install Dependencies
Ensure you have transformers
, torch
, and datasets
installed:
pip install transformers torch datasets
2๏ธโฃ Load the Model
To load and use the fine - tuned model:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "songhieng/khmer - mt5 - summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
๐ป Usage Examples
1๏ธโฃ Using Python Code
def summarize_khmer(text, max_length=150):
input_text = f"summarize: {text}"
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return summary
khmer_text = "แแแแแปแแถแแถแแแแแแถแแแแแแแถแ แกแฆ แแถแแแถแแ แ แพแแแถแแบแแถแแแแแแแแ
แแแแแแขแถแแแธแขแถแแแแแแแ"
summary = summarize_khmer(khmer_text)
print("๐น Khmer Summary:", summary)
2๏ธโฃ Using Hugging Face Pipeline
For a simpler approach:
from transformers import pipeline
summarizer = pipeline("summarization", model="songhieng/khmer - mt5 - summarization")
khmer_text = "แแแแแปแแถแแถแแแแแแถแแแแแแแถแ แกแฆ แแถแแแถแแ แ แพแแแถแแบแแถแแแแแแแแ
แแแแแแขแถแแแธแขแถแแแแแแแ"
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
print("๐น Khmer Summary:", summary[0]['summary_text'])
3๏ธโฃ Deploy as an API using FastAPI
You can create a simple API for summarization:
from fastapi import FastAPI
app = FastAPI()
@app.post("/summarize/")
def summarize(text: str):
inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=512)
summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return {"summary": summary}
๐ Documentation
๐ Model Evaluation
The model was evaluated using ROUGE scores, which measure how similar the generated summaries are to the ground truth summaries.
from datasets import load_metric
rouge = load_metric("rouge")
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
return rouge.compute(predictions=decoded_preds, references=decoded_labels)
trainer.evaluate()
๐พ Saving & Uploading the Model
After fine - tuning, the model was uploaded to Hugging Face Hub:
model.push_to_hub("songhieng/khmer - mt5 - summarization")
tokenizer.push_to_hub("songhieng/khmer - mt5 - summarization")
To download it later:
model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer - mt5 - summarization")
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer - mt5 - summarization")
๐ง Technical Details
The model is a fine - tuned version of Google's mT5 - small
for Khmer text summarization. It uses the Hugging Face Trainer
API for fine - tuning. The input is Khmer text, and the output is a summarized Khmer text. The model is evaluated using ROUGE scores, and it is trained on a GPU (Tesla T4).
๐ License
The model is licensed under the MIT license.
๐ค Contributing
Contributions are welcome! Feel free to open issues or submit pull requests if you find any improvements.
๐ฌ Contact
If you have any questions, feel free to reach out via Hugging Face Discussions or create an issue in the repository.
๐ Built for Khmer NLP Community ๐ฐ๐ญ ๐
Summary Table
Property |
Details |
Model Type |
google/mt5 - small |
Task |
Summarization |
Language |
Khmer (แแแแแ) |
Training Data |
kimleang123/khmer - text - dataset |
Framework |
Hugging Face Transformers |
Evaluation Metric |
ROUGE Score |
Deployment |
Hugging Face Model Hub, API (FastAPI), Python Code |