开源Pegasus-X-Large书籍摘要生成模型 - 高效处理长文档及书籍内容摘要

首页

Pegasus X Large Book Summary

由 pszemraj 开发

基于Pegasus-X-Large架构的书籍摘要生成模型，擅长处理长文档和书籍内容的摘要生成任务。

文本生成

Transformers

开源协议:Bsd-3-clause #长篇文档摘要 #书籍内容总结 #高精度ROUGE

下载量 734

发布时间 : 9/16/2022

模型简介

该模型专门用于生成书籍和长文档的摘要，支持英文文本处理，基于Apache-2.0和BSD-3-Clause许可证发布。

模型特点

长文档处理能力

专门优化用于处理书籍和长篇文档的摘要生成任务

高质量摘要

生成的摘要质量高，在ROUGE指标上表现良好

多许可证支持

同时支持Apache-2.0和BSD-3-Clause许可证，使用灵活

模型能力

书籍摘要生成

长文档总结

文本摘要

英文文本处理

使用案例

学术研究

科学论文摘要

为长篇科学论文生成简洁摘要

在ROUGE-1指标上达到33.14分

政府报告处理

政府报告摘要

处理长篇政府报告并生成摘要

在ROUGE-1指标上达到39.73分

书籍内容总结

书籍章节摘要

为书籍章节生成内容摘要

🚀 pszemraj/pegasus-x-large-book-summary

该模型基于Transformer架构，专为长文档摘要任务设计，能有效处理长序列文本，在多个数据集上取得了良好的ROUGE指标成绩，可用于地震、学术论文、讲座等多种文本的摘要生成。

🚀 快速开始

本模型可用于多种文本的摘要生成任务，如地震相关文本、学术论文、讲座转录文本等。以下是使用示例：

# 假设使用Hugging Face的transformers库
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 加载模型和分词器
tokenizer = AutoTokenizer.from_pretrained("pszemraj/pegasus-x-large-book-summary")
model = AutoModelForSeq2SeqLM.from_pretrained("pszemraj/pegasus-x-large-book-summary")

# 示例文本
text = "large earthquakes along a given fault segment do not occur at random intervals because it takes time to accumulate the strain energy for the rupture. The rates at which tectonic plates move and accumulate strain at their boundaries are approximately uniform. Therefore, in first approximation, one may expect that large ruptures of the same fault segment will occur at approximately constant time intervals. If subsequent main shocks have different amounts of slip across the fault, then the recurrence time may vary, and the basic idea of periodic mainshocks must be modified. For great plate boundary ruptures the length and slip often vary by a factor of 2. Along the southern segment of the San Andreas fault the recurrence interval is 145 years with variations of several decades. The smaller the standard deviation of the average recurrence interval, the more specific could be the long term prediction of a future mainshock."

# 对文本进行分词
inputs = tokenizer(text, return_tensors="pt")

# 生成摘要
summary_ids = model.generate(inputs["input_ids"], num_beams=2, early_stopping=True, length_penalty=0.1)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print(summary)

✨ 主要特性

处理长序列文本：能够处理长度较长的文本序列，如支持长度达到4096的序列，相比传统基于Transformer的模型，在处理长序列时计算成本更低。
高效注意力机制：采用块稀疏注意力（block sparse attention）代替普通注意力机制，有效降低了计算复杂度。
多任务适用性：适用于多种自然语言处理任务，特别是长文档摘要任务。

📚 详细文档

模型参数

属性	详情
最大长度	48
最小长度	2
无重复n - gram大小	3
编码器无重复n - gram大小	3
提前停止	开启
长度惩罚	0.1
束搜索数量	2
基础模型	google/pegasus-x-large