Khmer-mt5-summarization-1024tk-V2: An Open-Source Khmer Text Summarization Model - Free Generation of Article Paragraph Summaries

Khmer Mt5 Summarization 1024tk V2

Developed by songhieng

An improved Khmer text summarization model based on mT5-small, supporting inputs of up to 1024 tokens, suitable for summarizing Khmer articles, paragraphs, or documents.

Text Generation

Transformers

OtherOpen Source License:Apache-2.0 #Khmer Text Summarization #Long Text Support #mT5 Fine-tuning

Downloads 16

Release Time : 2/16/2025

Model Overview

This model is a fine-tuned version of google/mt5-small, specifically designed for Khmer text summarization tasks. It incorporates the kimleang123/rfi_news dataset in addition to the original dataset, enhancing summarization performance.

Model Features

Long Text Support

Supports Khmer text inputs of up to 1024 tokens, making it suitable for processing longer documents.

Enhanced Dataset

Incorporates the kimleang123/rfi_news dataset alongside the original dataset, improving summarization quality.

Efficient Inference

Based on the mT5-small architecture, it maintains good performance while offering high inference efficiency.

Model Capabilities

Khmer Text Summarization

Long Text Processing

Use Cases

News Summarization

Khmer News Auto-Summarization

Automatically generates summaries for Khmer news articles, extracting key information.

Document Processing

Khmer Document Summarization

Automatically summarizes long Khmer documents to aid in quick content comprehension.

🚀 Khmer mT5 Summarization Model (1024 Tokens) - V2

This repository houses an enhanced version of the Khmer mT5 summarization model, songhieng/khmer-mt5-summarization-1024tk-V2. Trained on an extended dataset, including data from kimleang123/rfi_news, it offers improved summarization performance for Khmer text.

✨ Features

Enhanced Performance: Trained on a larger dataset for better summarization of Khmer text.
Extended Input Length: Can handle up to 1024 tokens of Khmer text.
Multiple Usage Modes: Can be used via Python code, Hugging Face Pipeline, or deployed as an API.

📦 Installation

1️⃣ Install Dependencies

Make sure you have transformers, torch, and datasets installed:

pip install transformers torch datasets

2️⃣ Load the Model

To load and use the fine - tuned model:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "songhieng/khmer-mt5-summarization-1024tk-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

💻 Usage Examples

Basic Usage

def summarize_khmer(text, max_length=150):
    input_text = f"summarize: {text}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=1024)
    summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។"
summary = summarize_khmer(khmer_text)
print("Khmer Summary:", summary)

Advanced Usage

Using Hugging Face Pipeline

from transformers import pipeline

summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization-1024tk-V2")
khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។"
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
print("Khmer Summary:", summary[0]['summary_text'])

Deploy as an API using FastAPI

from fastapi import FastAPI

app = FastAPI()

@app.post("/summarize/")
def summarize(text: str):
    inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=1024)
    summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return {"summary": summary}

# Run with: uvicorn filename:app --reload

📚 Documentation

Model Evaluation

The model was evaluated using ROUGE scores, which measure the similarity between the generated summaries and the reference summaries.

from datasets import load_metric

rouge = load_metric("rouge")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    return rouge.compute(predictions=decoded_preds, references=decoded_labels)

trainer.evaluate()

Saving & Uploading the Model

After fine - tuning, the model can be uploaded to the Hugging Face Hub:

model.push_to_hub("songhieng/khmer-mt5-summarization-1024tk-V2")
tokenizer.push_to_hub("songhieng/khmer-mt5-summarization-1024tk-V2")

To download it later:

model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization-1024tk-V2")
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization-1024tk-V2")

📄 License

This project is licensed under the Apache-2.0 license.

Summary

Property	Details
Model Type	`google/mt5-small`
Task	Summarization
Language	Khmer (ខ្មែរ)
Training Data	`kimleang123/rfi_news` + previous dataset
Framework	Hugging Face Transformers
Evaluation Metric	ROUGE Score
Deployment	Hugging Face Model Hub, API (FastAPI), Python Code

Contributing

Contributions are welcome! Feel free to open issues or submit pull requests if you have any improvements or suggestions.

Contact

If you have any questions, feel free to reach out via Hugging Face Discussions or create an issue in the repository.

Built for the Khmer NLP Community

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご