Pegasus - Billsum Open-Source Text Summarization Model - Freely Generate High-Quality Text Summaries

Pegasus Billsum

Developed by google

PEGASUS is an abstractive summarization pre-trained model based on gap sentence extraction, focused on generating high-quality text summaries.

Text Generation

Transformers

English#Abstractive Summarization #Multi-dataset Pretraining #Gap Sentence Sampling

Downloads 295

Release Time : 3/2/2022

Model Overview

PEGASUS is a pre-trained model specifically designed for text summarization. It excels in multiple summarization tasks by leveraging gap sentence extraction during pre-training.

Model Features

Mixed and Random Training

Trained on both C4 and HugeNews datasets with mixed ratio weighting, increasing training steps to 1.5 million to enhance model performance.

Dynamic Sentence Sampling

Uniformly samples gap sentence ratios between 15% to 45% and adds 20% uniform noise during important sentence sampling to improve model robustness.

Improved Tokenizer

Updated SentencePiece tokenizer to support encoding line breaks, preventing information loss.

Model Capabilities

Text Summarization

Multilingual Support

High-precision Summarization

Use Cases

News Summarization

News Article Summarization

Generates concise summaries of news articles while retaining key information.

Achieved a ROUGE-1 score of 44.16 on the CNN/DailyMail dataset.

Academic Paper Summarization

Generates summaries of academic papers to aid quick comprehension.

Achieved a ROUGE-1 score of 44.21 on the arXiv dataset.

Technical Document Summarization

Generates summaries of technical documents for quick browsing.

Achieved a ROUGE-1 score of 52.29 on the BigPatent dataset.

🚀 Pegasus Models

Pegasus is a model designed for the task of summarization. It offers a novel approach to pre - training for abstractive summarization. This README provides details about the model, its training, and experimental results.

🚀 Quick Start

See Docs: here
Original TF 1 code here

✨ Features

Authors: Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019
Maintained by: @sshleifer
Task: Summarization

📚 Documentation

The following is copied from the authors' README.

Mixed & Stochastic Checkpoints

We train a pegasus model with sampled gap sentence ratios on both C4 and HugeNews, and stochastically sample important sentences. The updated results are reported in this table.

Dataset	C4	HugeNews	Mixed & Stochastic
xsum	45.20/22.06/36.99	47.21/24.56/39.25	47.60/24.83/39.64
cnn_dailymail	43.90/21.20/40.76	44.17/21.47/41.11	44.16/21.56/41.30
newsroom	45.07/33.39/41.28	45.15/33.51/41.33	45.98/34.20/42.18
multi_news	46.74/17.95/24.26	47.52/18.72/24.91	47.65/18.75/24.95
gigaword	38.75/19.96/36.14	39.12/19.86/36.24	39.65/20.47/36.76
wikihow	43.07/19.70/34.79	41.35/18.51/33.42	46.39/22.12/38.41 *
reddit_tifu	26.54/8.94/21.64	26.63/9.01/21.60	27.99/9.81/22.94
big_patent	53.63/33.16/42.25	53.41/32.89/42.07	52.29/33.08/41.66 *
arxiv	44.70/17.27/25.80	44.67/17.18/25.73	44.21/16.95/25.67
pubmed	45.49/19.90/27.69	45.09/19.56/27.42	45.97/20.15/28.25
aeslc	37.69/21.85/36.84	37.40/21.22/36.45	37.68/21.25/36.51
billsum	57.20/39.56/45.80	57.31/40.19/45.82	59.67/41.58/47.59

The "Mixed & Stochastic" model has the following changes:

Trained on both C4 and HugeNews (dataset mixture is weighted by their number of examples).
Trained for 1.5M instead of 500k (we observe slower convergence on pretraining perplexity).
The model uniformly samples a gap sentence ratio between 15% and 45%.
Importance sentences are sampled using a 20% uniform noise to importance scores.
The sentencepiece tokenizer is updated to be able to encode newline character.

⚠️ Important Note

The numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data:

The wikihow dataset contains newline characters which is useful for paragraph segmentation, the C4 and HugeNews model's sentencepiece tokenizer doesn't encode newline and loses this information.

We update the BigPatent dataset to preserve casing, some format cleanings are also changed, please refer to change in TFDS.

The "Mixed & Stochastic" model has the following changes (from pegasus - large in the paper):

Trained on both C4 and HugeNews (dataset mixture is weighted by their number of examples).
Trained for 1.5M instead of 500k (we observe slower convergence on pretraining perplexity).
The model uniformly samples a gap sentence ratio between 15% and 45%.
Importance sentences are sampled using a 20% uniform noise to importance scores.
The sentencepiece tokenizer is updated to be able to encode newline character.

📄 License

Citation

@misc{zhang2019pegasus,
    title={PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization},
    author={Jingqing Zhang and Yao Zhao and Mohammad Saleh and Peter J. Liu},
    year={2019},
    eprint={1912.08777},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご