T5 Large Finetuned Xsum
模型概述
該模型通過微調T5-large模型在XSUM數據集上訓練而成,能夠將長文本壓縮為簡潔的摘要,保持原文關鍵信息。
模型特點
高質量摘要生成
能夠生成保持原文關鍵信息的簡潔摘要
基於XSUM數據集微調
在專業摘要數據集上微調,優化了摘要生成能力
靈活的生成控制
支持多種生成參數調整,如長度懲罰、溫度採樣等
模型能力
文本摘要生成
長文本壓縮
關鍵信息提取
使用案例
新聞摘要
新聞文章摘要
將長篇新聞文章壓縮為簡短摘要
生成包含關鍵信息的1-2句摘要
文檔處理
報告摘要
自動生成長文檔或報告的摘要
提取文檔核心內容形成簡短概述
🚀 T5-large 基於 XSUM 數據集的文本摘要模型
該模型是經過微調的 T5-large 文本摘要模型,基於 XSUM 數據集進行訓練,能夠有效對文本進行摘要提取,為信息的快速獲取提供了便利。
🚀 快速開始
加載微調模型
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
tokenizer = AutoTokenizer.from_pretrained("sysresearch101/t5-large-finetuned-xsum")
model = model = AutoModelForSeq2SeqLM.from_pretrained("sysresearch101/t5-large-finetuned-xsum")
ARTICLE_TO_SUMMARIZE = "..."
# generate summary
input_ids = tokenizer.encode(ARTICLE_TO_SUMMARIZE, return_tensors='pt')
summary_ids = model.generate(input_ids,
min_length=20,
max_length=80,
num_beams=10,
repetition_penalty=2.5,
length_penalty=1.0,
early_stopping=True,
no_repeat_ngram_size=2,
use_cache=True,
do_sample = True,
temperature = 0.8,
top_k = 50,
top_p = 0.95)
summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary_text)
Output: <TODO>
通過管道使用模型
以下展示瞭如何使用 pipeline API 來使用這個模型:
from transformers import pipeline
summarizer = pipeline("summarization", model="sysresearch101/t5-large-finetuned-xsum")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))
>>> [{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]
✨ 主要特性
- 微調優化:
t5-large-finetuned-xsum
模型基於 huggingface 的t5-large model
,並使用 XSUM 數據集進行了微調,能更好地適應文本摘要任務。 - 多方式調用:既可以直接加載模型進行摘要生成,也可以通過管道 API 方便快捷地使用。
📚 詳細文檔
微調語料
t5-large-finetuned-xsum
模型基於 huggingface 的 t5-large model
,使用 XSUM 數據集進行微調。
模型評估結果
任務類型 | 數據集 | 評估指標 | 值 | 驗證狀態 |
---|---|---|---|---|
文本摘要 | xsum (測試集) | ROUGE-1 | 26.8921 | 已驗證 |
文本摘要 | xsum (測試集) | ROUGE-2 | 6.9411 | 已驗證 |
文本摘要 | xsum (測試集) | ROUGE-L | 21.2832 | 已驗證 |
文本摘要 | xsum (測試集) | ROUGE-LSUM | 21.284 | 已驗證 |
文本摘要 | xsum (測試集) | loss | 2.5411810874938965 | 已驗證 |
文本摘要 | xsum (測試集) | gen_len | 18.7755 | 已驗證 |
📄 許可證
本項目採用 MIT 許可證。
Bart Large Cnn
MIT
基於英語語料預訓練的BART模型,專門針對CNN每日郵報數據集進行微調,適用於文本摘要任務
文本生成 英語
B
facebook
3.8M
1,364
Parrot Paraphraser On T5
Parrot是一個基於T5的釋義框架,專為加速訓練自然語言理解(NLU)模型而設計,通過生成高質量釋義實現數據增強。
文本生成
Transformers

P
prithivida
910.07k
152
Distilbart Cnn 12 6
Apache-2.0
DistilBART是BART模型的蒸餾版本,專門針對文本摘要任務進行了優化,在保持較高性能的同時顯著提升了推理速度。
文本生成 英語
D
sshleifer
783.96k
278
T5 Base Summarization Claim Extractor
基於T5架構的模型,專門用於從摘要文本中提取原子聲明,是摘要事實性評估流程的關鍵組件。
文本生成
Transformers 英語

T
Babelscape
666.36k
9
Unieval Sum
UniEval是一個統一的多維評估器,用於自然語言生成任務的自動評估,支持多個可解釋維度的評估。
文本生成
Transformers

U
MingZhong
318.08k
3
Pegasus Paraphrase
Apache-2.0
基於PEGASUS架構微調的文本複述模型,能夠生成語義相同但表達不同的句子。
文本生成
Transformers 英語

P
tuner007
209.03k
185
T5 Base Korean Summarization
這是一個基於T5架構的韓語文本摘要模型,專為韓語文本摘要任務設計,通過微調paust/pko-t5-base模型在多個韓語數據集上訓練而成。
文本生成
Transformers 韓語

T
eenzeenee
148.32k
25
Pegasus Xsum
PEGASUS是一種基於Transformer的預訓練模型,專門用於抽象文本摘要任務。
文本生成 英語
P
google
144.72k
198
Bart Large Cnn Samsum
MIT
基於BART-large架構的對話摘要模型,專為SAMSum語料庫微調,適用於生成對話摘要。
文本生成
Transformers 英語

B
philschmid
141.28k
258
Kobart Summarization
MIT
基於KoBART架構的韓語文本摘要模型,能夠生成韓語新聞文章的簡潔摘要。
文本生成
Transformers 韓語

K
gogamza
119.18k
12
精選推薦AI模型
Llama 3 Typhoon V1.5x 8b Instruct
專為泰語設計的80億參數指令模型,性能媲美GPT-3.5-turbo,優化了應用場景、檢索增強生成、受限生成和推理任務
大型語言模型
Transformers 支持多種語言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型,專為邊緣設備推理設計,體積僅為Cosmo-3B模型的2%左右。
對話系統
Transformers 英語

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基於RoBERTa架構的中文抽取式問答模型,適用於從給定文本中提取答案的任務。
問答系統 中文
R
uer
2,694
98