๐ Khmer mT5 Summarization Model (1024 Tokens) - V2
This repository houses an enhanced version of the Khmer mT5 summarization model, songhieng/khmer-mt5-summarization-1024tk-V2. Trained on an extended dataset, including data from kimleang123/rfi_news, it offers improved summarization performance for Khmer text.
โจ Features
- Enhanced Performance: Trained on a larger dataset for better summarization of Khmer text.
- Extended Input Length: Can handle up to 1024 tokens of Khmer text.
- Multiple Usage Modes: Can be used via Python code, Hugging Face Pipeline, or deployed as an API.
๐ฆ Installation
1๏ธโฃ Install Dependencies
Make sure you have transformers
, torch
, and datasets
installed:
pip install transformers torch datasets
2๏ธโฃ Load the Model
To load and use the fine - tuned model:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "songhieng/khmer-mt5-summarization-1024tk-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
๐ป Usage Examples
Basic Usage
def summarize_khmer(text, max_length=150):
input_text = f"summarize: {text}"
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=1024)
summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return summary
khmer_text = "แแแแแปแแถแแถแแแแแแถแแแแแแแถแ แกแฆ แแถแแแถแแ แ แพแแแถแแบแแถแแแแแแแแ
แแแแแแขแถแแแธแขแถแแแแแแแ"
summary = summarize_khmer(khmer_text)
print("Khmer Summary:", summary)
Advanced Usage
Using Hugging Face Pipeline
from transformers import pipeline
summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization-1024tk-V2")
khmer_text = "แแแแแปแแถแแถแแแแแแถแแแแแแแถแ แกแฆ แแถแแแถแแ แ แพแแแถแแบแแถแแแแแแแแ
แแแแแแขแถแแแธแขแถแแแแแแแ"
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
print("Khmer Summary:", summary[0]['summary_text'])
Deploy as an API using FastAPI
from fastapi import FastAPI
app = FastAPI()
@app.post("/summarize/")
def summarize(text: str):
inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=1024)
summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return {"summary": summary}
๐ Documentation
Model Evaluation
The model was evaluated using ROUGE scores, which measure the similarity between the generated summaries and the reference summaries.
from datasets import load_metric
rouge = load_metric("rouge")
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
return rouge.compute(predictions=decoded_preds, references=decoded_labels)
trainer.evaluate()
Saving & Uploading the Model
After fine - tuning, the model can be uploaded to the Hugging Face Hub:
model.push_to_hub("songhieng/khmer-mt5-summarization-1024tk-V2")
tokenizer.push_to_hub("songhieng/khmer-mt5-summarization-1024tk-V2")
To download it later:
model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization-1024tk-V2")
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization-1024tk-V2")
๐ License
This project is licensed under the Apache-2.0 license.
Summary
Property |
Details |
Model Type |
google/mt5-small |
Task |
Summarization |
Language |
Khmer (แแแแแ) |
Training Data |
kimleang123/rfi_news + previous dataset |
Framework |
Hugging Face Transformers |
Evaluation Metric |
ROUGE Score |
Deployment |
Hugging Face Model Hub, API (FastAPI), Python Code |
Contributing
Contributions are welcome! Feel free to open issues or submit pull requests if you have any improvements or suggestions.
Contact
If you have any questions, feel free to reach out via Hugging Face Discussions or create an issue in the repository.
Built for the Khmer NLP Community