lsg-bart-base-16384-pubmed Open-source Text Summarization Model - Accurately Process Ultra-long PubMed Papers

Lsg Bart Base 16384 Pubmed

Developed by ccdv

A long-sequence text summarization model based on the BART architecture, specifically fine-tuned for the PubMed scientific paper dataset, capable of processing input sequences up to 16,384 tokens in length

Text Generation

Transformers

English#Long Text Summarization #Medical Paper Processing #16384 Long Sequence

Downloads 22

Release Time : 5/9/2022

Model Overview

This model employs a local-sparse-global attention mechanism for long-sequence text summarization tasks, particularly suitable for generating summaries of lengthy texts such as scientific papers

Model Features

Long Sequence Processing Capability

Capable of processing input sequences up to 16,384 tokens, making it particularly suitable for long document summarization

Efficient Attention Mechanism

Utilizes a local-sparse-global attention mechanism to enhance long-sequence processing efficiency while maintaining performance

Scientific Paper Optimization

Specifically fine-tuned for the PubMed scientific paper dataset, ideal for academic text summarization

Model Capabilities

Long Text Summarization Generation

Scientific Paper Content Extraction

English Text Processing

Use Cases

Academic Research

Automatic Summarization of Scientific Papers

Generates concise and accurate summaries for lengthy research papers

Achieves a ROUGE-1 score of 48.32 on the PubMed dataset

Literature Review Assistance

Helps researchers quickly grasp the core content of multiple papers

🚀 ccdv/lsg-bart-base-16384-pubmed

This model is a fine - tuned version of ccdv/lsg-bart-base-4096-pubmed on the scientific_papers pubmed dataset, capable of handling 16384 long sequences.

🚀 Quick Start

Prerequisites

Transformers >= 4.36.1 This model relies on a custom modeling file, you need to add trust_remote_code=True See #13467

LSG ArXiv paper. Github/conversion script is available at this link.

Code Example

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-bart-base-16384-pubmed", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("ccdv/lsg-bart-base-16384-pubmed", trust_remote_code=True)

text = "Replace by what you want."
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device=0)
generated_text = pipe(
  text, 
  truncation=True, 
  max_length=64, 
  no_repeat_ngram_size=7,
  num_beams=2,
  early_stopping=True
  )

✨ Features

Long Sequence Handling: The model is converted to handle 16384 long sequences.
Fine - Tuned: It is a fine - tuned version of ccdv/lsg-bart-base-4096-pubmed on the scientific_papers pubmed dataset.

📚 Documentation

Performance on Test Set

Length	Global tokens	Fine-tuning	Block Size	Connexions	R1	R2	RL	RLsum
16384	64	Full	256	768	48.32	22.52	29.36	44.57
16384	1	Full	256	768	48.26	22.53	29.40	44.51
16384	64	Global only	256	768	48.12	20.46	29.34	44.40
16384	1	None	256	768	48.03	22.42	29.28	44.32

Reference Model Performance

Length	Global tokens	Fine-tuning	Block Size	Sparsity	Connexions	R1	R2	RL	RLsum
4096	1	-	256	0	768	47.37	21.74	28.59	43.67

🔧 Technical Details

Model Description

The model relies on Local - Sparse - Global attention to handle long sequences: attn

The model has about ~145 millions parameters (6 encoder layers - 6 decoder layers). The model is warm started from ccdv/lsg-bart-base-4096-pubmed, converted to handle long sequences (encoder only) and fine tuned.

Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

learning_rate: 8e-05
train_batch_size: 8
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1.0

Generate Hyperparameters

The following hyperparameters were used during generation:

dataset_name: scientific_papers
dataset_config_name: pubmed
eval_batch_size: 4
eval_samples: 6658
early_stopping: True
ignore_pad_token_for_loss: True
length_penalty: 2.0
max_length: 512
min_length: 128
num_beams: 5
no_repeat_ngram_size: None
seed: 123

Framework Versions

Transformers 4.18.0
Pytorch 1.10.1+cu102
Datasets 2.1.0
Tokenizers 0.11.6

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご