đ Malicious URL Detection Model
A fine-tuned BERT-LoRA model for detecting malicious URLs, including phishing, malware, and defacement threats.
đ Quick Start
This is a fine - tuned BERT - based classifier aimed at real - time detection of malicious URLs. It uses Low - Rank Adaptation (LoRA) for efficient fine - tuning, reducing computational costs while maintaining high accuracy.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "your-huggingface-model-name"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
url = "http://example.com/login"
inputs = tokenizer(url, return_tensors="pt", truncation=True, padding=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits).item()
label_map = {0: "Benign", 1: "Defacement", 2: "Phishing", 3: "Malware"}
print(f"Prediction: {label_map[prediction]}")
⨠Features
- Classifies URLs into four categories: Benign, Defacement, Phishing, and Malware.
- Achieves 98% validation accuracy and an F1 - score of 0.965, ensuring robust detection capabilities.
đĻ Installation
The README does not provide specific installation steps, so this section is skipped.
đģ Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "your-huggingface-model-name"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
url = "http://example.com/login"
inputs = tokenizer(url, return_tensors="pt", truncation=True, padding=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits).item()
label_map = {0: "Benign", 1: "Defacement", 2: "Phishing", 3: "Malware"}
print(f"Prediction: {label_map[prediction]}")
Advanced Usage
The README does not provide advanced usage examples, so this part is not added.
đ Documentation
Intended Uses
Use Cases
- Real - time URL classification for cybersecurity tools.
- Phishing and malware detection for online safety.
- Integration into browser extensions for instant threat alerts.
- Security monitoring for SOC (Security Operations Centers).
Model Details
Property |
Details |
Model Type |
BERT - based URL Classifier |
Fine - Tuning Method |
LoRA (Low - Rank Adaptation) |
Base Model |
bert - base - uncased |
Number of Parameters |
110M |
Dataset |
Kaggle Malicious URLs Dataset (~651,191 samples) |
Max Sequence Length |
128 |
Framework |
đ¤ transformers , torch , peft |
Training Details
- Batch Size:
16
- Epochs:
5
- Learning Rate:
2e - 5
- Optimizer: AdamW with weight decay
- Loss Function: Weighted Cross - Entropy
- Evaluation Strategy: Epoch - based
- Fine - Tuning Strategy: LoRA applied to BERT layers
Evaluation Results
Metric |
Value |
Accuracy |
98% |
Precision |
0.96 |
Recall |
0.97 |
F1 Score |
0.965 |
Category - wise Performance
Category |
Precision |
Recall |
F1 - Score |
Benign |
0.98 |
0.99 |
0.985 |
Defacement |
0.98 |
0.99 |
0.985 |
Phishing |
0.93 |
0.94 |
0.935 |
Malware |
0.95 |
0.96 |
0.955 |
Deployment Options
Streamlit Web App
- Deployed on Streamlit Cloud, AWS, or Google Cloud.
- Provides real - time URL analysis with a user - friendly interface.
Browser Extension (Planned)
- Real - time scanning of visited web pages.
- Dynamic threat alerts with confidence scores.
API Integration
- REST API for bulk URL analysis.
- Supports Security Operations Centers (SOC).
Limitations & Bias
â ī¸ Important Note
- The model may misclassify complex phishing URLs that mimic legitimate sites.
- It needs regular updates to counter evolving threats.
- There is potential bias if future threats are not represented in training data.
Training Data & Citation
Data Source
Dataset sourced from Kaggle Malicious URLs Dataset:
đ Dataset Link
BibTeX Citation
@article{maliciousurl2025,
author = {Gleyzie Tongo, Dr. Farnaz Farid, Dr. Ala Al-Areqi, Dr. Farhad Ahamed},
title = {Fine-Tuned BERT for Malicious URL Detection},
year = {2025},
institution = {Western Sydney University}
}
Contact
For inquiries, collaborations, or feedback, feel free to reach out via LinkedIn:
đ Gleyzie Tongo
đ License
This project is licensed under the Apache - 2.0 license.