Pegasus-Large-Privacy-Policy-Summarization-V2 Open Source Model - Freely summarize lengthy privacy policies into concise versions

Pegasus Large Privacy Policy Summarization V2

Developed by AryehRotberg

Fine-tuned based on Google's Pegasus Large model, specifically designed for summarizing lengthy privacy policy documents into concise versions.

Text Generation

Transformers

EnglishOpen Source License:MIT #Privacy Policy Summary #Legal Text Condensation #ROUGE Optimization

Downloads 13

Release Time : 2/9/2025

Model Overview

This model is fine-tuned on privacy policy documents and their corresponding summaries, capable of condensing complex legal texts into readable summaries, suitable for compliance analysis and legal document processing.

Model Features

Domain-Specific Fine-tuning

Specifically optimized for privacy policy texts, better handling legal terminology and complex clauses.

High-Quality Summaries

Performs well on ROUGE metrics, generating coherent and informative summaries.

Easy Integration

Provides simple API interfaces for easy integration into existing systems.

Model Capabilities

Legal Text Summarization

Document Condensation

Use Cases

Legal Compliance

Quick Understanding of Privacy Policies

Helps users quickly understand lengthy privacy policy terms

Generates concise and readable summaries with ROUGE-1 score of 0.514

Compliance Analysis

Assists businesses in privacy policy compliance checks

Identifies key privacy clauses, improving review efficiency

Business Applications

Provides simplified service terms explanations for end-users

Enhances user experience and transparency

🚀 Pegasus Large Privacy Policy Summarization V2

A fine - tuned Google Pegasus Large model for summarizing privacy policy documents.

🚀 Quick Start

Use the code below to get started with the model.

import torch
from transformers import PegasusTokenizer, PegasusForConditionalGeneration

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_checkpoint = "AryehRotberg/Pegasus-Large-Privacy-Policy-Summarization-V2"
model = PegasusForConditionalGeneration.from_pretrained(model_checkpoint).to(device)
tokenizer = PegasusTokenizer.from_pretrained(model_checkpoint)

def summarize(text):
    inputs = tokenizer(
        f"Summarize the following document: {text}\nSummary: ",
        padding="max_length",
        truncation=True,
        max_length=1024,
        return_tensors="pt",
    ).to(device)

    outputs = model.generate(**inputs)

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

✨ Features

Transformer-based Summarization: A Transformer-based abstractive summarization model for privacy policy documents.
Fine-tuned on Specific Data: Fine-tuned on a curated dataset of privacy policy documents and their summaries.
Multiple Use Cases: Suitable for direct summarization and can be further fine-tuned for domain - specific tasks.

📦 Installation

The provided code snippet assumes you have torch and transformers libraries installed. If not, you can install them using the following commands:

pip install torch
pip install transformers

💻 Usage Examples

Basic Usage

import torch
from transformers import PegasusTokenizer, PegasusForConditionalGeneration

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_checkpoint = "AryehRotberg/Pegasus-Large-Privacy-Policy-Summarization-V2"
model = PegasusForConditionalGeneration.from_pretrained(model_checkpoint).to(device)
tokenizer = PegasusTokenizer.from_pretrained(model_checkpoint)

def summarize(text):
    inputs = tokenizer(
        f"Summarize the following document: {text}\nSummary: ",
        padding="max_length",
        truncation=True,
        max_length=1024,
        return_tensors="pt",
    ).to(device)

    outputs = model.generate(**inputs)

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
privacy_policy_text = "Your long privacy policy text here..."
summary = summarize(privacy_policy_text)
print(summary)

Advanced Usage

# For domain - specific fine - tuning
# This is a high - level example and may need to be adjusted according to your specific requirements
from transformers import TrainingArguments, Trainer

# Load the model and tokenizer
model_checkpoint = "AryehRotberg/Pegasus-Large-Privacy-Policy-Summarization-V2"
model = PegasusForConditionalGeneration.from_pretrained(model_checkpoint)
tokenizer = PegasusTokenizer.from_pretrained(model_checkpoint)

# Prepare your domain - specific dataset
# Assume you have a dataset in the format of (document, summary) pairs
train_dataset = ...
val_dataset = ...

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=2,   # batch size per device during training
    per_device_eval_batch_size=2,    # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    metric_for_best_model="rouge1",
    load_best_model_at_end=True
)

# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Fine - tune the model
trainer.train()

📚 Documentation

Model Details

Property	Details
Model Type	Transformer-based abstractive summarization model
Architecture	Google PEGASUS Large
Fine-tuning Dataset	A curated dataset of privacy policy documents and their corresponding summaries.
Intended Use	Summarizing long and complex privacy policies into concise and readable summaries.
Limitations	May miss critical nuances, legal jargon, or context-dependent details in privacy policies.

Uses

Direct Use

This model can be used for summarizing lengthy privacy policy documents into concise summaries. It is designed for applications that require automated document summarization, such as compliance analysis and legal document processing.

Downstream Use

This model can be fine-tuned further for domain-specific summarization tasks related to legal, business, or government policy documents.

Out-of-Scope Use

Legal Advice: The model is not a replacement for professional legal consultation.
Summarization of Non-Privacy-Related Texts: Performance may degrade on general texts outside privacy policies.
High-Stakes Decision-Making: Should not be used in critical legal or compliance decisions without human oversight.

Bias, Risks, and Limitations

Risks

Summarization Bias: The model may overemphasize certain parts of privacy policies while omitting crucial information.
Misinterpretation: Legal terms might not be accurately represented in layman's summaries.
Data Sensitivity: Summarization results could be misleading if applied to incomplete or biased datasets.

Recommendations

⚠️ Important Note

Human verification of summaries is advised, especially for legal and compliance use cases. Users should be aware of the potential biases in the training data. Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

🔧 Technical Details

Training and Evaluation Data

The documents and summaries were extracted from the ToS;DR website's API. Only comprehensively reviewed website documents with a rating were used.

Training Procedure

Preprocessing

TextRank algorithm was used to extract the top n sentences from both the documents and summaries, with a maximum of 30 sentences for documents and 20 for summaries. BeautifulSoup library was used to parse HTML text, and regular expressions were applied to remove excessive spaces. The dataset was then split into training and validation sets, with a test size of 0.2 and a random seed of 42.

Training Hyperparameters

Epochs: 10
Weight decay: 0.01
Batch size: 2 (train & eval)
Logging steps: 10
Warmup steps: 500
Evaluation strategy: epoch
Save strategy: epoch
Metric for best model: ROUGE - 1
Load best model at end: True
Prediction mode: predict_with_generate=True
Optimizer: Adam with learning rate 0.001
Scheduler: Linear scheduler with warmup: num_warmup_steps = 500, num_training_steps = 1500
Reporting: MLflow

Evaluation

Metrics

ROUGE scores (ROUGE - 1, ROUGE - 2, ROUGE - L) were used to measure summarization quality.

Results

Metric	Value
rouge1	0.5141839409652631
rouge2	0.2895850459169673
rougeL	0.27764589200709305
rougeLsum	0.2776501244969102

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご