🚀 pszemraj/pegasus-x-large-book-summary
该模型基于Transformer架构,专为长文档摘要任务设计,能有效处理长序列文本,在多个数据集上取得了良好的ROUGE指标成绩,可用于地震、学术论文、讲座等多种文本的摘要生成。
🚀 快速开始
本模型可用于多种文本的摘要生成任务,如地震相关文本、学术论文、讲座转录文本等。以下是使用示例:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("pszemraj/pegasus-x-large-book-summary")
model = AutoModelForSeq2SeqLM.from_pretrained("pszemraj/pegasus-x-large-book-summary")
text = "large earthquakes along a given fault segment do not occur at random intervals because it takes time to accumulate the strain energy for the rupture. The rates at which tectonic plates move and accumulate strain at their boundaries are approximately uniform. Therefore, in first approximation, one may expect that large ruptures of the same fault segment will occur at approximately constant time intervals. If subsequent main shocks have different amounts of slip across the fault, then the recurrence time may vary, and the basic idea of periodic mainshocks must be modified. For great plate boundary ruptures the length and slip often vary by a factor of 2. Along the southern segment of the San Andreas fault the recurrence interval is 145 years with variations of several decades. The smaller the standard deviation of the average recurrence interval, the more specific could be the long term prediction of a future mainshock."
inputs = tokenizer(text, return_tensors="pt")
summary_ids = model.generate(inputs["input_ids"], num_beams=2, early_stopping=True, length_penalty=0.1)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)
✨ 主要特性
- 处理长序列文本:能够处理长度较长的文本序列,如支持长度达到4096的序列,相比传统基于Transformer的模型,在处理长序列时计算成本更低。
- 高效注意力机制:采用块稀疏注意力(block sparse attention)代替普通注意力机制,有效降低了计算复杂度。
- 多任务适用性:适用于多种自然语言处理任务,特别是长文档摘要任务。
📚 详细文档
模型参数
属性 |
详情 |
最大长度 |
48 |
最小长度 |
2 |
无重复n - gram大小 |
3 |
编码器无重复n - gram大小 |
3 |
提前停止 |
开启 |
长度惩罚 |
0.1 |
束搜索数量 |
2 |
基础模型 |
google/pegasus-x-large |
数据集与指标
模型在多个数据集上进行了测试,以下是部分数据集的测试结果:
samsum数据集
指标类型 |
指标值 |
指标名称 |
rouge |
33.1401 |
ROUGE - 1 |
rouge |
9.3095 |
ROUGE - 2 |
rouge |
24.8552 |
ROUGE - L |
rouge |
29.0391 |
ROUGE - LSUM |
loss |
2.288182497024536 |
loss |
gen_len |
45.2173 |
gen_len |
launch/gov_report数据集
指标类型 |
指标值 |
指标名称 |
rouge |
39.7279 |
ROUGE - 1 |
rouge |
10.8944 |
ROUGE - 2 |
rouge |
19.7018 |
ROUGE - L |
rouge |
36.5634 |
ROUGE - LSUM |
loss |
2.473011016845703 |
loss |
gen_len |
212.8243 |
gen_len |
billsum数据集
指标类型 |
指标值 |
指标名称 |
rouge |
42.1065 |
ROUGE - 1 |
rouge |
15.4079 |
ROUGE - 2 |
rouge |
24.8814 |
ROUGE - L |
rouge |
36.0375 |
ROUGE - LSUM |
loss |
1.9130958318710327 |
loss |
gen_len |
179.2184 |
gen_len |
📄 许可证
本模型使用的许可证包括:
- Apache - 2.0
- BSD - 3 - Clause