khmer-mt5-summarization開源模型 - 免費生成簡潔且語義豐富的高棉語文本摘要

首頁

Khmer Mt5 Summarization

由songhieng開發

這是一個針對高棉語文本摘要任務微調的mT5模型，基於谷歌的mT5-small模型，在高棉語文本數據集上微調，能夠生成簡潔且語義豐富的高棉語文本摘要。

文本生成

Transformers

其他開源協議:MIT #高棉語文本摘要 #mT5微調 #多語言支持

下載量 58

發布時間 : 2/11/2025

模型概述

該模型專門用於高棉語文本的自動摘要生成，適用於文章、段落或文檔的摘要任務。

模型特點

高棉語優化

專門針對高棉語文本進行微調，優化了摘要生成質量

輕量級模型

基於mT5-small架構，在保持性能的同時減少計算資源需求

多長度摘要

支持通過參數調整生成不同長度的摘要

模型能力

高棉語文本理解

自動摘要生成

長文本壓縮

使用案例

新聞媒體

新聞文章摘要

自動生成高棉語新聞文章的簡短摘要

幫助讀者快速瞭解新聞要點

教育研究

學術論文摘要

為高棉語學術論文生成結構化摘要

提高研究文獻的檢索和閱讀效率

🚀 高棉語mT5文本摘要模型

本項目是一個針對高棉語文本摘要任務微調的mT5模型。它基於Google的mT5-small模型，並在高棉語文本及其對應摘要的數據集上進行了微調。通過使用Hugging Face的Trainer API進行微調，該模型能夠生成簡潔且有意義的高棉語文本摘要。

🚀 快速開始

✨ 主要特性

基礎模型：基於google/mt5-small。
微調任務：專注於高棉語文本摘要。
訓練數據集：使用kimleang123/khmer-text-dataset進行訓練。
框架：採用Hugging Face的transformers。
任務類型：序列到序列（Seq2Seq）任務。
輸入：高棉語文本（文章、段落或文檔）。
輸出：高棉語摘要文本。
訓練硬件：使用GPU（Tesla T4）進行訓練。
評估指標：使用ROUGE分數進行評估。

📦 安裝指南

1️⃣ 安裝依賴

確保你已經安裝了transformers、torch和datasets：

pip install transformers torch datasets

2️⃣ 加載模型

加載並使用微調後的模型：

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "songhieng/khmer-mt5-summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

💻 使用示例

基礎用法

def summarize_khmer(text, max_length=150):
    input_text = f"summarize: {text}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
    summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។"
summary = summarize_khmer(khmer_text)
print("🔹 高棉語摘要:", summary)

高級用法

使用Hugging Face Pipeline進行更簡單的操作：

from transformers import pipeline

summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization")
khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។"
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
print("🔹 高棉語摘要:", summary[0]['summary_text'])

部署為API

使用FastAPI將模型部署為API：

from fastapi import FastAPI

app = FastAPI()

@app.post("/summarize/")
def summarize(text: str):
    inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=512)
    summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return {"summary": summary}

# 使用以下命令運行: uvicorn filename:app --reload

🔧 技術細節

模型評估使用ROUGE分數，以衡量生成的摘要與真實摘要之間的相似度：

from datasets import load_metric

rouge = load_metric("rouge")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    return rouge.compute(predictions=decoded_preds, references=decoded_labels)

trainer.evaluate()

💾 模型保存與上傳

微調完成後，可將模型上傳到Hugging Face Hub：

model.push_to_hub("songhieng/khmer-mt5-summarization")
tokenizer.push_to_hub("songhieng/khmer-mt5-summarization")

後續下載模型：

model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization")
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization")

📚 詳細文檔

屬性	詳情
基礎模型	`google/mt5-small`
任務類型	文本摘要
語言	高棉語（ខ្មែរ）
訓練數據	`kimleang123/khmer-text-dataset`
框架	Hugging Face Transformers
評估指標	ROUGE分數
部署方式	Hugging Face模型中心、API（FastAPI）、Python代碼