Pegasus-large open-source summary generation model - Free to use, outputs accurate summaries based on Google technology

Pegasus Large

Developed by google

PEGASUS is an abstractive summarization model based on pre-training with gap sentences, developed by Google Research.

Text Generation English#Text Summarization #Pre-trained Models #Multi-dataset Mixing

Downloads 43.35k

Release Time : 3/2/2022

Model Overview

PEGASUS is a pre-trained model specifically designed for abstractive summarization, utilizing gap sentences for pre-training and suitable for various summarization tasks.

Model Features

Mixed and Random Training

Trained on both C4 and HugeNews datasets with sample-size-weighted mixing ratio for 1.5 million steps.

Dynamic Sentence Sampling

Uniformly samples 15% to 45% gap sentence ratio with 20% uniform noise added to importance scores.

Improved Tokenizer

Updated SentencePiece tokenizer to support encoding line breaks, enhancing paragraph segmentation.

Model Capabilities

Text Summarization Generation

Multi-dataset Adaptation

Abstractive Summarization

Use Cases

News Summarization

CNN/DailyMail Summarization

Generates concise summaries for CNN/DailyMail news articles.

ROUGE-1/2/L: 44.16/21.56/41.30

XSum Summarization

Produces results for extreme summarization (single-sentence summary) tasks.

ROUGE-1/2/L: 47.60/24.83/39.64

Academic Paper Summarization

arXiv Summarization

Generates summaries for arXiv academic papers.

ROUGE-1/2/L: 44.21/16.95/25.67

PubMed Summarization

Generates summaries for PubMed medical papers.

ROUGE-1/2/L: 45.97/20.15/28.25

🚀 Pegasus Models

Pegasus models are designed for text summarization tasks, offering high - performance solutions based on advanced training techniques.

🚀 Quick Start

For detailed documentation, please refer to here. The original TF 1 code can be found here.

✨ Features

Authors: Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
Maintained by: @sshleifer.
Task: Summarization

📚 Documentation

The following content is copied from the authors' README.

Mixed & Stochastic Checkpoints

We trained a Pegasus model with sampled gap - sentence ratios on both C4 and HugeNews, and stochastically sampled important sentences. The updated results are reported in the following table:

Dataset	C4	HugeNews	Mixed & Stochastic
xsum	45.20/22.06/36.99	47.21/24.56/39.25	47.60/24.83/39.64
cnn_dailymail	43.90/21.20/40.76	44.17/21.47/41.11	44.16/21.56/41.30
newsroom	45.07/33.39/41.28	45.15/33.51/41.33	45.98/34.20/42.18
multi_news	46.74/17.95/24.26	47.52/18.72/24.91	47.65/18.75/24.95
gigaword	38.75/19.96/36.14	39.12/19.86/36.24	39.65/20.47/36.76
wikihow	43.07/19.70/34.79	41.35/18.51/33.42	46.39/22.12/38.41 *
reddit_tifu	26.54/8.94/21.64	26.63/9.01/21.60	27.99/9.81/22.94
big_patent	53.63/33.16/42.25	53.41/32.89/42.07	52.29/33.08/41.66 *
arxiv	44.70/17.27/25.80	44.67/17.18/25.73	44.21/16.95/25.67
pubmed	45.49/19.90/27.69	45.09/19.56/27.42	45.97/20.15/28.25
aeslc	37.69/21.85/36.84	37.40/21.22/36.45	37.68/21.25/36.51
billsum	57.20/39.56/45.80	57.31/40.19/45.82	59.67/41.58/47.59

The "Mixed & Stochastic" model has the following changes:

Trained on both C4 and HugeNews (the dataset mixture is weighted by their number of examples).
Trained for 1.5M instead of 500k (we observed slower convergence on pretraining perplexity).
The model uniformly samples a gap - sentence ratio between 15% and 45%.
Important sentences are sampled using a 20% uniform noise to importance scores.
The SentencePiece tokenizer is updated to be able to encode newline characters.

(*) The numbers of the Wikihow and BigPatent datasets are not comparable because of changes in tokenization and data:

The Wikihow dataset contains newline characters which are useful for paragraph segmentation. The C4 and HugeNews model's SentencePiece tokenizer doesn't encode newlines and loses this information.
We updated the BigPatent dataset to preserve casing, and some format cleanings were also changed. Please refer to the changes in TFDS.

The "Mixed & Stochastic" model has the following changes (from Pegasus - large in the paper):

Trained on both C4 and HugeNews (the dataset mixture is weighted by their number of examples).
Trained for 1.5M instead of 500k (we observed slower convergence on pretraining perplexity).
The model uniformly samples a gap - sentence ratio between 15% and 45%.
Important sentences are sampled using a 20% uniform noise to importance scores.
The SentencePiece tokenizer is updated to be able to encode newline characters.

Citation

@misc{zhang2019pegasus,
    title={PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization},
    author={Jingqing Zhang and Yao Zhao and Mohammad Saleh and Peter J. Liu},
    year={2019},
    eprint={1912.08777},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご