Khmer-mt5-summarization Open-source Model - Free Generation of Concise and Semantically Rich Khmer Text Summaries

Khmer Mt5 Summarization

Developed by songhieng

This is an mT5 model fine-tuned for Khmer text summarization tasks, based on Google's mT5-small model. It was fine-tuned on a Khmer text dataset and can generate concise and semantically rich Khmer text summaries.

Text Generation

Transformers

OtherOpen Source License:MIT #Khmer Text Summarization #mT5 Fine-tuning #Multilingual Support

Downloads 58

Release Time : 2/11/2025

Model Overview

This model is specifically designed for automatic summarization of Khmer texts, suitable for summarizing articles, paragraphs, or documents.

Model Features

Khmer Optimization

Specially fine-tuned for Khmer text, optimizing summarization quality

Lightweight Model

Based on the mT5-small architecture, reducing computational resource requirements while maintaining performance

Multi-length Summaries

Supports generating summaries of different lengths through parameter adjustments

Model Capabilities

Khmer Text Understanding

Automatic Summarization

Long Text Compression

Use Cases

News Media

News Article Summarization

Automatically generates concise summaries of Khmer news articles

Helps readers quickly grasp key points of the news

Education & Research

Academic Paper Summarization

Generates structured summaries for Khmer academic papers

Improves efficiency in searching and reading research literature

🚀 Khmer mT5 Summarization Model

This repository houses a fine - tuned mT5 model for Khmer text summarization. Based on Google's [mT5 - small](https://huggingface.co/google/mt5 - small), the model has been fine - tuned on a dataset of Khmer text and their corresponding summaries. Fine - tuning was carried out using the Hugging Face Trainer API, optimizing the model to generate concise and meaningful summaries of Khmer text.

🚀 Quick Start

✨ Features

Base Model: google/mt5 - small
Fine - tuned for: Khmer text summarization
Training Dataset: kimleang123/khmer - text - dataset
Framework: Hugging Face transformers
Task Type: Sequence - to - Sequence (Seq2Seq)
Input: Khmer text (articles, paragraphs, or documents)
Output: Summarized Khmer text
Training Hardware: GPU (Tesla T4)
Evaluation Metric: ROUGE Score

📦 Installation

1️⃣ Install Dependencies

Ensure you have transformers, torch, and datasets installed:

pip install transformers torch datasets

2️⃣ Load the Model

To load and use the fine - tuned model:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "songhieng/khmer - mt5 - summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

💻 Usage Examples

1️⃣ Using Python Code

def summarize_khmer(text, max_length=150):
    input_text = f"summarize: {text}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
    summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។"
summary = summarize_khmer(khmer_text)
print("🔹 Khmer Summary:", summary)

2️⃣ Using Hugging Face Pipeline

For a simpler approach:

from transformers import pipeline

summarizer = pipeline("summarization", model="songhieng/khmer - mt5 - summarization")
khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។"
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
print("🔹 Khmer Summary:", summary[0]['summary_text'])

3️⃣ Deploy as an API using FastAPI

You can create a simple API for summarization:

from fastapi import FastAPI

app = FastAPI()

@app.post("/summarize/")
def summarize(text: str):
    inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=512)
    summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return {"summary": summary}

# Run with: uvicorn filename:app --reload

📚 Documentation

📊 Model Evaluation

The model was evaluated using ROUGE scores, which measure how similar the generated summaries are to the ground truth summaries.

from datasets import load_metric

rouge = load_metric("rouge")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    return rouge.compute(predictions=decoded_preds, references=decoded_labels)

trainer.evaluate()

💾 Saving & Uploading the Model

After fine - tuning, the model was uploaded to Hugging Face Hub:

model.push_to_hub("songhieng/khmer - mt5 - summarization")
tokenizer.push_to_hub("songhieng/khmer - mt5 - summarization")

To download it later:

model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer - mt5 - summarization")
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer - mt5 - summarization")

🔧 Technical Details

The model is a fine - tuned version of Google's mT5 - small for Khmer text summarization. It uses the Hugging Face Trainer API for fine - tuning. The input is Khmer text, and the output is a summarized Khmer text. The model is evaluated using ROUGE scores, and it is trained on a GPU (Tesla T4).

📄 License

The model is licensed under the MIT license.

🤝 Contributing

Contributions are welcome! Feel free to open issues or submit pull requests if you find any improvements.

📬 Contact

If you have any questions, feel free to reach out via Hugging Face Discussions or create an issue in the repository.

📌 Built for Khmer NLP Community 🇰🇭 🚀

Summary Table

Property	Details
Model Type	`google/mt5 - small`
Task	Summarization
Language	Khmer (ខ្មែរ)
Training Data	`kimleang123/khmer - text - dataset`
Framework	Hugging Face Transformers
Evaluation Metric	ROUGE Score
Deployment	Hugging Face Model Hub, API (FastAPI), Python Code

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご