T5 Large Finetuned Xsum
模型简介
该模型通过微调T5-large模型在XSUM数据集上训练而成,能够将长文本压缩为简洁的摘要,保持原文关键信息。
模型特点
高质量摘要生成
能够生成保持原文关键信息的简洁摘要
基于XSUM数据集微调
在专业摘要数据集上微调,优化了摘要生成能力
灵活的生成控制
支持多种生成参数调整,如长度惩罚、温度采样等
模型能力
文本摘要生成
长文本压缩
关键信息提取
使用案例
新闻摘要
新闻文章摘要
将长篇新闻文章压缩为简短摘要
生成包含关键信息的1-2句摘要
文档处理
报告摘要
自动生成长文档或报告的摘要
提取文档核心内容形成简短概述
🚀 T5-large 基于 XSUM 数据集的文本摘要模型
该模型是经过微调的 T5-large 文本摘要模型,基于 XSUM 数据集进行训练,能够有效对文本进行摘要提取,为信息的快速获取提供了便利。
🚀 快速开始
加载微调模型
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
tokenizer = AutoTokenizer.from_pretrained("sysresearch101/t5-large-finetuned-xsum")
model = model = AutoModelForSeq2SeqLM.from_pretrained("sysresearch101/t5-large-finetuned-xsum")
ARTICLE_TO_SUMMARIZE = "..."
# generate summary
input_ids = tokenizer.encode(ARTICLE_TO_SUMMARIZE, return_tensors='pt')
summary_ids = model.generate(input_ids,
min_length=20,
max_length=80,
num_beams=10,
repetition_penalty=2.5,
length_penalty=1.0,
early_stopping=True,
no_repeat_ngram_size=2,
use_cache=True,
do_sample = True,
temperature = 0.8,
top_k = 50,
top_p = 0.95)
summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary_text)
Output: <TODO>
通过管道使用模型
以下展示了如何使用 pipeline API 来使用这个模型:
from transformers import pipeline
summarizer = pipeline("summarization", model="sysresearch101/t5-large-finetuned-xsum")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))
>>> [{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]
✨ 主要特性
- 微调优化:
t5-large-finetuned-xsum
模型基于 huggingface 的t5-large model
,并使用 XSUM 数据集进行了微调,能更好地适应文本摘要任务。 - 多方式调用:既可以直接加载模型进行摘要生成,也可以通过管道 API 方便快捷地使用。
📚 详细文档
微调语料
t5-large-finetuned-xsum
模型基于 huggingface 的 t5-large model
,使用 XSUM 数据集进行微调。
模型评估结果
任务类型 | 数据集 | 评估指标 | 值 | 验证状态 |
---|---|---|---|---|
文本摘要 | xsum (测试集) | ROUGE-1 | 26.8921 | 已验证 |
文本摘要 | xsum (测试集) | ROUGE-2 | 6.9411 | 已验证 |
文本摘要 | xsum (测试集) | ROUGE-L | 21.2832 | 已验证 |
文本摘要 | xsum (测试集) | ROUGE-LSUM | 21.284 | 已验证 |
文本摘要 | xsum (测试集) | loss | 2.5411810874938965 | 已验证 |
文本摘要 | xsum (测试集) | gen_len | 18.7755 | 已验证 |
📄 许可证
本项目采用 MIT 许可证。
Bart Large Cnn
MIT
基于英语语料预训练的BART模型,专门针对CNN每日邮报数据集进行微调,适用于文本摘要任务
文本生成 英语
B
facebook
3.8M
1,364
Parrot Paraphraser On T5
Parrot是一个基于T5的释义框架,专为加速训练自然语言理解(NLU)模型而设计,通过生成高质量释义实现数据增强。
文本生成
Transformers

P
prithivida
910.07k
152
Distilbart Cnn 12 6
Apache-2.0
DistilBART是BART模型的蒸馏版本,专门针对文本摘要任务进行了优化,在保持较高性能的同时显著提升了推理速度。
文本生成 英语
D
sshleifer
783.96k
278
T5 Base Summarization Claim Extractor
基于T5架构的模型,专门用于从摘要文本中提取原子声明,是摘要事实性评估流程的关键组件。
文本生成
Transformers 英语

T
Babelscape
666.36k
9
Unieval Sum
UniEval是一个统一的多维评估器,用于自然语言生成任务的自动评估,支持多个可解释维度的评估。
文本生成
Transformers

U
MingZhong
318.08k
3
Pegasus Paraphrase
Apache-2.0
基于PEGASUS架构微调的文本复述模型,能够生成语义相同但表达不同的句子。
文本生成
Transformers 英语

P
tuner007
209.03k
185
T5 Base Korean Summarization
这是一个基于T5架构的韩语文本摘要模型,专为韩语文本摘要任务设计,通过微调paust/pko-t5-base模型在多个韩语数据集上训练而成。
文本生成
Transformers 韩语

T
eenzeenee
148.32k
25
Pegasus Xsum
PEGASUS是一种基于Transformer的预训练模型,专门用于抽象文本摘要任务。
文本生成 英语
P
google
144.72k
198
Bart Large Cnn Samsum
MIT
基于BART-large架构的对话摘要模型,专为SAMSum语料库微调,适用于生成对话摘要。
文本生成
Transformers 英语

B
philschmid
141.28k
258
Kobart Summarization
MIT
基于KoBART架构的韩语文本摘要模型,能够生成韩语新闻文章的简洁摘要。
文本生成
Transformers 韩语

K
gogamza
119.18k
12
精选推荐AI模型
Llama 3 Typhoon V1.5x 8b Instruct
专为泰语设计的80亿参数指令模型,性能媲美GPT-3.5-turbo,优化了应用场景、检索增强生成、受限生成和推理任务
大型语言模型
Transformers 支持多种语言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一个基于SODA数据集训练的超小型对话模型,专为边缘设备推理设计,体积仅为Cosmo-3B模型的2%左右。
对话系统
Transformers 英语

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基于RoBERTa架构的中文抽取式问答模型,适用于从给定文本中提取答案的任务。
问答系统 中文
R
uer
2,694
98