đ Pegasus Large Privacy Policy Summarization V2
A fine - tuned Google Pegasus Large model for summarizing privacy policy documents.
đ Quick Start
Use the code below to get started with the model.
import torch
from transformers import PegasusTokenizer, PegasusForConditionalGeneration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_checkpoint = "AryehRotberg/Pegasus-Large-Privacy-Policy-Summarization-V2"
model = PegasusForConditionalGeneration.from_pretrained(model_checkpoint).to(device)
tokenizer = PegasusTokenizer.from_pretrained(model_checkpoint)
def summarize(text):
inputs = tokenizer(
f"Summarize the following document: {text}\nSummary: ",
padding="max_length",
truncation=True,
max_length=1024,
return_tensors="pt",
).to(device)
outputs = model.generate(**inputs)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
⨠Features
- Transformer-based Summarization: A Transformer-based abstractive summarization model for privacy policy documents.
- Fine-tuned on Specific Data: Fine-tuned on a curated dataset of privacy policy documents and their summaries.
- Multiple Use Cases: Suitable for direct summarization and can be further fine-tuned for domain - specific tasks.
đĻ Installation
The provided code snippet assumes you have torch
and transformers
libraries installed. If not, you can install them using the following commands:
pip install torch
pip install transformers
đģ Usage Examples
Basic Usage
import torch
from transformers import PegasusTokenizer, PegasusForConditionalGeneration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_checkpoint = "AryehRotberg/Pegasus-Large-Privacy-Policy-Summarization-V2"
model = PegasusForConditionalGeneration.from_pretrained(model_checkpoint).to(device)
tokenizer = PegasusTokenizer.from_pretrained(model_checkpoint)
def summarize(text):
inputs = tokenizer(
f"Summarize the following document: {text}\nSummary: ",
padding="max_length",
truncation=True,
max_length=1024,
return_tensors="pt",
).to(device)
outputs = model.generate(**inputs)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
privacy_policy_text = "Your long privacy policy text here..."
summary = summarize(privacy_policy_text)
print(summary)
Advanced Usage
from transformers import TrainingArguments, Trainer
model_checkpoint = "AryehRotberg/Pegasus-Large-Privacy-Policy-Summarization-V2"
model = PegasusForConditionalGeneration.from_pretrained(model_checkpoint)
tokenizer = PegasusTokenizer.from_pretrained(model_checkpoint)
train_dataset = ...
val_dataset = ...
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=10,
evaluation_strategy="epoch",
save_strategy="epoch",
metric_for_best_model="rouge1",
load_best_model_at_end=True
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset
)
trainer.train()
đ Documentation
Model Details
Property |
Details |
Model Type |
Transformer-based abstractive summarization model |
Architecture |
Google PEGASUS Large |
Fine-tuning Dataset |
A curated dataset of privacy policy documents and their corresponding summaries. |
Intended Use |
Summarizing long and complex privacy policies into concise and readable summaries. |
Limitations |
May miss critical nuances, legal jargon, or context-dependent details in privacy policies. |
Uses
Direct Use
This model can be used for summarizing lengthy privacy policy documents into concise summaries. It is designed for applications that require automated document summarization, such as compliance analysis and legal document processing.
Downstream Use
This model can be fine-tuned further for domain-specific summarization tasks related to legal, business, or government policy documents.
Out-of-Scope Use
- Legal Advice: The model is not a replacement for professional legal consultation.
- Summarization of Non-Privacy-Related Texts: Performance may degrade on general texts outside privacy policies.
- High-Stakes Decision-Making: Should not be used in critical legal or compliance decisions without human oversight.
Bias, Risks, and Limitations
Risks
- Summarization Bias: The model may overemphasize certain parts of privacy policies while omitting crucial information.
- Misinterpretation: Legal terms might not be accurately represented in layman's summaries.
- Data Sensitivity: Summarization results could be misleading if applied to incomplete or biased datasets.
Recommendations
â ī¸ Important Note
Human verification of summaries is advised, especially for legal and compliance use cases. Users should be aware of the potential biases in the training data. Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
đ§ Technical Details
Training and Evaluation Data
The documents and summaries were extracted from the ToS;DR website's API. Only comprehensively reviewed website documents with a rating were used.
Training Procedure
Preprocessing
TextRank algorithm was used to extract the top n sentences from both the documents and summaries, with a maximum of 30 sentences for documents and 20 for summaries. BeautifulSoup library was used to parse HTML text, and regular expressions were applied to remove excessive spaces. The dataset was then split into training and validation sets, with a test size of 0.2 and a random seed of 42.
Training Hyperparameters
- Epochs: 10
- Weight decay: 0.01
- Batch size: 2 (train & eval)
- Logging steps: 10
- Warmup steps: 500
- Evaluation strategy: epoch
- Save strategy: epoch
- Metric for best model: ROUGE - 1
- Load best model at end: True
- Prediction mode: predict_with_generate=True
- Optimizer: Adam with learning rate 0.001
- Scheduler: Linear scheduler with warmup: num_warmup_steps = 500, num_training_steps = 1500
- Reporting: MLflow
Evaluation
Metrics
- ROUGE scores (ROUGE - 1, ROUGE - 2, ROUGE - L) were used to measure summarization quality.
Results
Metric |
Value |
rouge1 |
0.5141839409652631 |
rouge2 |
0.2895850459169673 |
rougeL |
0.27764589200709305 |
rougeLsum |
0.2776501244969102 |
đ License
This project is licensed under the MIT license.