Legal Pegasus
模型简介
该模型是基于google/pegasus-cnn_dailymail微调的法律领域版本,专注于法律文档的抽象摘要生成任务,支持1024个标记的输入序列长度。
模型特点
法律领域优化
专门针对法律文档进行微调,能够更好地理解和摘要法律相关内容
长文本支持
支持最大1024个标记的输入序列长度,适合处理较长的法律文档
高质量摘要
相比基础模型,在法律文档摘要任务上ROUGE指标有显著提升
模型能力
法律文本理解
抽象摘要生成
长文本处理
使用案例
法律文档处理
诉讼公告摘要
自动生成诉讼公告的简洁摘要
ROUGE-1得分57.39,显著优于通用摘要模型
起诉书摘要
从复杂起诉书中提取关键信息生成摘要
ROUGE-L得分30.91,法律术语保留准确
🚀 用于法律文档摘要的PEGASUS
legal-pegasus 是 google/pegasus-cnn_dailymail 在 法律领域 的微调版本,经过训练可执行 抽象摘要 任务。输入序列的最大长度为 1024 个标记。
🚀 快速开始
安装依赖
本项目依赖 transformers
库,你可以使用以下命令进行安装:
pip install transformers
代码示例
以下是使用该模型进行法律文档摘要的代码示例:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("nsi319/legal-pegasus")
model = AutoModelForSeq2SeqLM.from_pretrained("nsi319/legal-pegasus")
text = """On March 5, 2021, the Securities and Exchange Commission charged AT&T, Inc. with repeatedly violating Regulation FD, and three of its Investor Relations executives with aiding and abetting AT&T's violations, by selectively disclosing material nonpublic information to research analysts. According to the SEC's complaint, AT&T learned in March 2016 that a steeper-than-expected decline in its first quarter smartphone sales would cause AT&T's revenue to fall short of analysts' estimates for the quarter. The complaint alleges that to avoid falling short of the consensus revenue estimate for the third consecutive quarter, AT&T Investor Relations executives Christopher Womack, Michael Black, and Kent Evans made private, one-on-one phone calls to analysts at approximately 20 separate firms. On these calls, the AT&T executives allegedly disclosed AT&T's internal smartphone sales data and the impact of that data on internal revenue metrics, despite the fact that internal documents specifically informed Investor Relations personnel that AT&T's revenue and sales of smartphones were types of information generally considered "material" to AT&T investors, and therefore prohibited from selective disclosure under Regulation FD. The complaint further alleges that as a result of what they were told on these calls, the analysts substantially reduced their revenue forecasts, leading to the overall consensus revenue estimate falling to just below the level that AT&T ultimately reported to the public on April 26, 2016. The SEC's complaint, filed in federal district court in Manhattan, charges AT&T with violations of the disclosure provisions of Section 13(a) of the Securities Exchange Act of 1934 and Regulation FD thereunder, and charges Womack, Evans and Black with aiding and abetting these violations. The complaint seeks permanent injunctive relief and civil monetary penalties against each defendant. The SEC's investigation was conducted by George N. Stepaniuk, Thomas Peirce, and David Zetlin-Jones of the SEC's New York Regional Office. The SEC's litigation will be conducted by Alexander M. Vasilescu, Victor Suthammanont, and Mr. Zetlin-Jones. The case is being supervised by Sanjay Wadhwa."""
input_tokenized = tokenizer.encode(text, return_tensors='pt',max_length=1024,truncation=True)
summary_ids = model.generate(input_tokenized,
num_beams=9,
no_repeat_ngram_size=3,
length_penalty=2.0,
min_length=150,
max_length=250,
early_stopping=True)
summary = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids][0]
### Summary Output
# The Securities and Exchange Commission today charged AT&T, Inc. and three of its Investor Relations executives with aiding and abetting the company's violations of the antifraud provisions of Section 10(b) of the Securities Exchange Act of 1934 and Rule 10b-5 thereunder. According to the SEC's complaint, the company learned in March 2016 that a steeper-than-expected decline in its first quarter smartphone sales would cause its revenue to fall short of analysts' estimates for the quarter. The complaint alleges that to avoid falling short of the consensus revenue estimate for the third consecutive quarter, the executives made private, one-on-one phone calls to analysts at approximately 20 separate firms. On these calls, the SEC alleges that Christopher Womack, Michael Black, and Kent Evans allegedly disclosed internal smartphone sales data and the impact of that data on internal revenue metrics. The SEC further alleges that as a result of what they were told, the analysts substantially reduced their revenue forecasts, leading to the overall consensus Revenue Estimate falling to just below the level that AT&t ultimately reported to the public on April 26, 2016. The SEC is seeking permanent injunctive relief and civil monetary penalties against each defendant.
📦 训练数据
该模型在 sec-litigation-releases 数据集上进行训练,该数据集包含超过 2700 份诉讼公告和投诉。
📊 评估结果
模型 | rouge1 | rouge1 精度 | rouge2 | rouge2 精度 | rougeL | rougeL 精度 |
---|---|---|---|---|---|---|
legal-pegasus | 57.39 | 62.97 | 26.85 | 28.42 | 30.91 | 33.22 |
pegasus-cnn_dailymail | 43.16 | 45.68 | 13.75 | 14.56 | 18.82 | 20.07 |
📄 许可证
本项目采用 MIT 许可证。
Bart Large Cnn
MIT
基于英语语料预训练的BART模型,专门针对CNN每日邮报数据集进行微调,适用于文本摘要任务
文本生成 英语
B
facebook
3.8M
1,364
Parrot Paraphraser On T5
Parrot是一个基于T5的释义框架,专为加速训练自然语言理解(NLU)模型而设计,通过生成高质量释义实现数据增强。
文本生成
Transformers

P
prithivida
910.07k
152
Distilbart Cnn 12 6
Apache-2.0
DistilBART是BART模型的蒸馏版本,专门针对文本摘要任务进行了优化,在保持较高性能的同时显著提升了推理速度。
文本生成 英语
D
sshleifer
783.96k
278
T5 Base Summarization Claim Extractor
基于T5架构的模型,专门用于从摘要文本中提取原子声明,是摘要事实性评估流程的关键组件。
文本生成
Transformers 英语

T
Babelscape
666.36k
9
Unieval Sum
UniEval是一个统一的多维评估器,用于自然语言生成任务的自动评估,支持多个可解释维度的评估。
文本生成
Transformers

U
MingZhong
318.08k
3
Pegasus Paraphrase
Apache-2.0
基于PEGASUS架构微调的文本复述模型,能够生成语义相同但表达不同的句子。
文本生成
Transformers 英语

P
tuner007
209.03k
185
T5 Base Korean Summarization
这是一个基于T5架构的韩语文本摘要模型,专为韩语文本摘要任务设计,通过微调paust/pko-t5-base模型在多个韩语数据集上训练而成。
文本生成
Transformers 韩语

T
eenzeenee
148.32k
25
Pegasus Xsum
PEGASUS是一种基于Transformer的预训练模型,专门用于抽象文本摘要任务。
文本生成 英语
P
google
144.72k
198
Bart Large Cnn Samsum
MIT
基于BART-large架构的对话摘要模型,专为SAMSum语料库微调,适用于生成对话摘要。
文本生成
Transformers 英语

B
philschmid
141.28k
258
Kobart Summarization
MIT
基于KoBART架构的韩语文本摘要模型,能够生成韩语新闻文章的简洁摘要。
文本生成
Transformers 韩语

K
gogamza
119.18k
12
精选推荐AI模型
Llama 3 Typhoon V1.5x 8b Instruct
专为泰语设计的80亿参数指令模型,性能媲美GPT-3.5-turbo,优化了应用场景、检索增强生成、受限生成和推理任务
大型语言模型
Transformers 支持多种语言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一个基于SODA数据集训练的超小型对话模型,专为边缘设备推理设计,体积仅为Cosmo-3B模型的2%左右。
对话系统
Transformers 英语

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基于RoBERTa架构的中文抽取式问答模型,适用于从给定文本中提取答案的任务。
问答系统 中文
R
uer
2,694
98