Pegasus-Xsum Open-source Text Summarization Model - Accomplish Abstract Text Summarization Tasks for Free and Efficiently

Pegasus Xsum

Developed by google

PEGASUS is a Transformer-based pretrained model specifically designed for abstractive text summarization tasks.

Text Generation English#Text Summarization Pretraining #Hybrid Dataset Optimization #Dynamic Sentence Sampling

Downloads 144.72k

Release Time : 3/2/2022

Model Overview

PEGASUS is a Transformer architecture-based pretrained model, specifically designed for abstractive text summarization tasks. It learns to generate high-quality summaries by pretraining on large-scale text data.

Model Features

Hybrid and Random Training

Trained simultaneously on C4 and HugeNews datasets with sample count-weighted mixing ratios and random sampling of important sentences.

Dynamic Sentence Gap Ratio

Uniformly samples sentence gap ratios between 15% to 45% during training to enhance model adaptability.

Importance Score Noise

Adds 20% uniform noise to importance scores during sentence sampling to improve model robustness.

Improved Tokenizer

Updated SentencePiece tokenizer to support encoding newline characters, preserving paragraph segmentation information.

Model Capabilities

Text Summarization Generation

Multi-document Summarization

Abstractive Summarization

Use Cases

News Summarization

CNN/DailyMail News Summarization

Generates concise summaries for CNN/DailyMail news articles

ROUGE-1/2/L: 44.16/21.56/41.30

Academic Paper Summarization

arXiv Paper Summarization

Generates summaries for arXiv academic papers

ROUGE-1/2/L: 44.21/16.95/25.67

Legal Document Summarization

BigPatent Patent Summarization

Generates summaries for patent documents

ROUGE-1/2/L: 52.29/33.08/41.66

🚀 Google Pegasus-XSum Model

A powerful model for text summarization, offering high - performance results on multiple datasets.

📚 Documentation

See Docs: here
Original TF 1 code here

👥 Authors and Maintainer

Authors: Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019
Maintained by: @sshleifer

🎯 Task

The model is designed for the task of Summarization.

📊 Model Results

Property	Details
Model Name	google/pegasus-xsum
Task Type	Summarization
Results on samsum (train split)	ROUGE - 1: 21.8096 ROUGE - 2: 4.2525 ROUGE - L: 17.4469 ROUGE - LSUM: 18.8907 loss: 3.0317161083221436 gen_len: 20.3122
Results on xsum (test split)	ROUGE - 1: 46.8623 ROUGE - 2: 24.4533 ROUGE - L: 39.0548 ROUGE - LSUM: 39.0994 loss: 1.5717021226882935 gen_len: 22.8821
Results on cnn_dailymail (test split)	ROUGE - 1: 22.2062 ROUGE - 2: 7.6701 ROUGE - L: 15.4046 ROUGE - LSUM: 19.2182 loss: 2.681241273880005 gen_len: 25.0234

📈 Mixed & Stochastic Checkpoints

We train a pegasus model with sampled gap sentence ratios on both C4 and HugeNews, and stochastically sample important sentences. The updated results are reported in this table.

dataset	C4	HugeNews	Mixed & Stochastic
xsum	45.20/22.06/36.99	47.21/24.56/39.25	47.60/24.83/39.64
cnn_dailymail	43.90/21.20/40.76	44.17/21.47/41.11	44.16/21.56/41.30
newsroom	45.07/33.39/41.28	45.15/33.51/41.33	45.98/34.20/42.18
multi_news	46.74/17.95/24.26	47.52/18.72/24.91	47.65/18.75/24.95
gigaword	38.75/19.96/36.14	39.12/19.86/36.24	39.65/20.47/36.76
wikihow	43.07/19.70/34.79	41.35/18.51/33.42	46.39/22.12/38.41 *
reddit_tifu	26.54/8.94/21.64	26.63/9.01/21.60	27.99/9.81/22.94
big_patent	53.63/33.16/42.25	53.41/32.89/42.07	52.29/33.08/41.66 *
arxiv	44.70/17.27/25.80	44.67/17.18/25.73	44.21/16.95/25.67
pubmed	45.49/19.90/27.69	45.09/19.56/27.42	45.97/20.15/28.25
aeslc	37.69/21.85/36.84	37.40/21.22/36.45	37.68/21.25/36.51
billsum	57.20/39.56/45.80	57.31/40.19/45.82	59.67/41.58/47.59

Changes in the "Mixed & Stochastic" Model

Trained on both C4 and HugeNews (dataset mixture is weighted by their number of examples).
Trained for 1.5M instead of 500k (we observe slower convergence on pretraining perplexity).
The model uniformly samples a gap sentence ratio between 15% and 45%.
Importance sentences are sampled using a 20% uniform noise to importance scores.
The sentencepiece tokenizer is updated to be able to encode newline character.

⚠️ Important Note

(*) the numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data:

The wikihow dataset contains newline characters which is useful for paragraph segmentation, the C4 and HugeNews model's sentencepiece tokenizer doesn't encode newline and loses this information.
We update the BigPatent dataset to preserve casing, some format cleanings are also changed, please refer to change in TFDS.

📖 Citation

@misc{zhang2019pegasus,
    title={PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization},
    author={Jingqing Zhang and Yao Zhao and Mohammad Saleh and Peter J. Liu},
    year={2019},
    eprint={1912.08777},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご