開源Pegasus-X-Large書籍摘要生成模型 - 高效處理長文檔及書籍內容摘要

首頁

Pegasus X Large Book Summary

由pszemraj開發

基於Pegasus-X-Large架構的書籍摘要生成模型，擅長處理長文檔和書籍內容的摘要生成任務。

文本生成

Transformers

開源協議:Bsd-3-clause #長篇文檔摘要 #書籍內容總結 #高精度ROUGE

下載量 734

發布時間 : 9/16/2022

模型概述

該模型專門用於生成書籍和長文檔的摘要，支持英文文本處理，基於Apache-2.0和BSD-3-Clause許可證發佈。

模型特點

長文檔處理能力

專門優化用於處理書籍和長篇文檔的摘要生成任務

高質量摘要

生成的摘要質量高，在ROUGE指標上表現良好

多許可證支持

同時支持Apache-2.0和BSD-3-Clause許可證，使用靈活

模型能力

書籍摘要生成

長文檔總結

文本摘要

英文文本處理

使用案例

學術研究

科學論文摘要

為長篇科學論文生成簡潔摘要

在ROUGE-1指標上達到33.14分

政府報告處理

政府報告摘要

處理長篇政府報告並生成摘要

在ROUGE-1指標上達到39.73分

書籍內容總結

書籍章節摘要

為書籍章節生成內容摘要

🚀 pszemraj/pegasus-x-large-book-summary

該模型基於Transformer架構，專為長文檔摘要任務設計，能有效處理長序列文本，在多個數據集上取得了良好的ROUGE指標成績，可用於地震、學術論文、講座等多種文本的摘要生成。

🚀 快速開始

本模型可用於多種文本的摘要生成任務，如地震相關文本、學術論文、講座轉錄文本等。以下是使用示例：

# 假設使用Hugging Face的transformers庫
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 加載模型和分詞器
tokenizer = AutoTokenizer.from_pretrained("pszemraj/pegasus-x-large-book-summary")
model = AutoModelForSeq2SeqLM.from_pretrained("pszemraj/pegasus-x-large-book-summary")

# 示例文本
text = "large earthquakes along a given fault segment do not occur at random intervals because it takes time to accumulate the strain energy for the rupture. The rates at which tectonic plates move and accumulate strain at their boundaries are approximately uniform. Therefore, in first approximation, one may expect that large ruptures of the same fault segment will occur at approximately constant time intervals. If subsequent main shocks have different amounts of slip across the fault, then the recurrence time may vary, and the basic idea of periodic mainshocks must be modified. For great plate boundary ruptures the length and slip often vary by a factor of 2. Along the southern segment of the San Andreas fault the recurrence interval is 145 years with variations of several decades. The smaller the standard deviation of the average recurrence interval, the more specific could be the long term prediction of a future mainshock."

# 對文本進行分詞
inputs = tokenizer(text, return_tensors="pt")

# 生成摘要
summary_ids = model.generate(inputs["input_ids"], num_beams=2, early_stopping=True, length_penalty=0.1)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print(summary)

✨ 主要特性

處理長序列文本：能夠處理長度較長的文本序列，如支持長度達到4096的序列，相比傳統基於Transformer的模型，在處理長序列時計算成本更低。
高效注意力機制：採用塊稀疏注意力（block sparse attention）代替普通注意力機制，有效降低了計算複雜度。
多任務適用性：適用於多種自然語言處理任務，特別是長文檔摘要任務。

📚 詳細文檔

模型參數

屬性	詳情
最大長度	48
最小長度	2
無重複n - gram大小	3
編碼器無重複n - gram大小	3
提前停止	開啟
長度懲罰	0.1
束搜索數量	2
基礎模型	google/pegasus-x-large