PD-BERT Open-Source Paraphrase Detection Model - Free for Duplicate Content, Q&A, and Semantic Similarity Analysis

Pd Bert

Developed by viswadarshan06

A BERT-base fine-tuned model for paraphrase detection, suitable for duplicate content identification, Q&A systems, and semantic similarity analysis.

Text Classification

Transformers

EnglishOpen Source License:MIT #High-recall paraphrase identification #Multi-dataset fusion #Semantic similarity analysis

Downloads 23

Release Time : 2/9/2025

Model Overview

This model, fine-tuned on the BERT-base architecture, specializes in identifying paraphrase relationships between sentence pairs. It excels on multiple benchmark datasets, particularly in detecting paraphrases within complex sentence structures.

Model Features

Multi-dataset Training

Combines four benchmark datasets (MRPC, QQP, PAWS-X, and PIT) covering various paraphrase scenarios including news, Q&A, and adversarial testing.

High-Recall Design

Optimized model structure prioritizes recall capability for paraphrase relationships, making it ideal for applications requiring high coverage.

Strong Domain Adaptability

The base model is trained on diverse domain data and can be quickly fine-tuned for specialized fields like healthcare and law.

Model Capabilities

Sentence pair semantic similarity analysis

Duplicate question detection

Text deduplication

Q&A system enhancement

Use Cases

Customer Support

FAQ Deduplication

Automatically identifies duplicate questions in user query databases

Reduces manual review workload by 30% (based on paper inference)

Content Management

News Aggregation

Identifies duplicate news reports from different sources

Achieves 84.87% accuracy on the MRPC test set

🚀 Model Card for Fine-Tuned BERT for Paraphrase Detection

This is a fine-tuned BERT-base model for paraphrase detection. It's trained on four benchmark datasets (MRPC, QQP, PAWS-X, and PIT). The model is useful for applications like duplicate content detection, question answering, and semantic similarity analysis, with strong recall capabilities for identifying paraphrases in complex sentence structures.

🚀 Quick Start

To use the model, install transformers and load the fine-tuned model as follows:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the tokenizer and model
model_path = "viswadarshan06/pd-bert"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# Encode sentence pairs
inputs = tokenizer("The car is fast.", "The vehicle moves quickly.", return_tensors="pt", padding=True, truncation=True)

# Get predictions
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax().item()
print("Paraphrase" if predicted_class == 1 else "Not a Paraphrase")

✨ Features

Direct Use:
- Identify duplicate questions in customer support and FAQs.
- Improve semantic search in retrieval-based systems.
- Enhance document deduplication and text similarity applications.
Downstream Use: Can be further fine-tuned on domain - specific paraphrase datasets for industries like healthcare, legal, and finance.

📦 Installation

To use the model, you need to install the transformers library. You can install it via pip:

pip install transformers

📚 Documentation

Model Description

Developed by: Viswadarshan R R
Model Type: Transformer-based Sentence Pair Classifier
Language: English
Finetuned from: bert-base-cased

Model Sources

Repository: Hugging Face Model Hub
Research Paper: Comparative Insights into Modern Architectures for Paraphrase Detection (Accepted at ICCIDS 2025)
Demo: (To be added upon deployment)

Uses

Direct Use

Identifying duplicate questions in customer support and FAQs.
Improving semantic search in retrieval-based systems.
Enhancing document deduplication and text similarity applications.

Downstream Use

This model can be further fine-tuned on domain-specific paraphrase datasets for industries such as healthcare, legal, and finance.

Out-of-Scope Use

The model is monolingual and trained only on English datasets, requiring additional fine-tuning for multilingual tasks.
May struggle with idiomatic expressions or complex figurative language.

Bias, Risks, and Limitations

Known Limitations

Higher recall but lower precision: The model tends to over-identify paraphrases, leading to increased false positives.
Contextual ambiguity: May misinterpret sentences that require deep contextual reasoning.

Recommendations

Users can mitigate the false positive rate by applying post-processing techniques or confidence threshold tuning.

🔧 Technical Details

Training Details

This model was trained using a combination of four datasets:

MRPC: News-based paraphrases.
QQP: Duplicate question detection.
PAWS-X: Adversarial paraphrases for robustness testing.
PIT: Short-text paraphrase dataset.

Training Procedure

Tokenizer: BERT Tokenizer
Batch Size: 16
Optimizer: AdamW
Loss Function: Cross-entropy

Training Hyperparameters

Learning Rate: 2e-5
Sequence Length:
- MRPC: 256
- QQP: 336
- PIT: 64
- PAWS-X: 256

Speeds, Sizes, Times

GPU Used: NVIDIA A100
Total Training Time: ~6 hours
Compute Units Used: 80

Testing Data, Factors & Metrics

Testing Data

The model was tested on combined test sets and evaluated using:

Accuracy
Precision
Recall
F1-Score
Runtime

Results

BERT Model Evaluation Metrics

Model	Dataset	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	Runtime (sec)
BERT	MRPC Validation	88.24	88.37	95.34	91.72	1.41
BERT	MRPC Test	84.87	85.84	92.50	89.04	5.77
BERT	QQP Validation	87.92	81.44	86.86	84.06	43.24
BERT	QQP Test	88.14	82.49	86.56	84.47	43.51
BERT	PAWS-X Validation	91.90	87.57	94.67	90.98	6.73
BERT	PAWS-X Test	92.60	88.69	95.92	92.16	6.82
BERT	PIT Validation	77.38	72.41	58.57	64.76	4.34
BERT	PIT Test	86.16	64.11	76.57	69.79	0.98

Summary

This BERT-based Paraphrase Detection Model demonstrates strong recall capabilities, making it highly effective at identifying paraphrases across varied linguistic structures. While it tends to overpredict paraphrases, it remains a strong baseline for semantic similarity tasks and can be fine-tuned further for domain-specific applications.

Citation

If you use this model, please cite:

@inproceedings{viswadarshan2025paraphrase,
   title={Comparative Insights into Modern Architectures for Paraphrase Detection},
   author={Viswadarshan R R, Viswaa Selvam S, Felcia Lilian J, Mahalakshmi S},
   booktitle={International Conference on Computational Intelligence, Data Science, and Security (ICCIDS)},
   year={2025},
   publisher={IFIP AICT Series by Springer}
}

📄 License

This project is licensed under the MIT license.

Model Card Contact

📧 Email: viswadarshanrramiya@gmail.com 🔗 GitHub: Viswadarshan R R

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご