🚀 PII-RANHA: Privacy-Preserving Token Classification Model
PII-RANHA is a fine - tuned token classification model based on ModernBERT - base from Answer.AI. It aims to identify and classify Personally Identifiable Information (PII) in text data. Trained on the ai4privacy/pii - masking - 400k
dataset, it can detect 17 different PII categories, including account numbers, credit card numbers, email addresses, etc. This model is designed for privacy - preserving applications like data anonymization, redaction, or compliance with data protection regulations.
✨ Features
- Fine - tuned on the
ai4privacy/pii - masking - 400k
dataset.
- Capable of detecting 17 different PII categories.
- Suitable for various privacy - preserving applications.
📦 Installation
To use the model, make sure you have the transformers
and datasets
libraries installed:
pip install transformers datasets
💻 Usage Examples
Basic Usage
Here’s how to load and use the model for PII detection:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_name = "scampion/piiranha"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
pii_pipeline = pipeline("token - classification", model=model, tokenizer=tokenizer)
text = "My email is john.doe@example.com and my phone number is 555 - 123 - 4567."
results = pii_pipeline(text)
for entity in results:
print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.4f}")
Sample Output
Entity: Ġj, Label: I - ACCOUNTNUM, Score: 0.6445
Entity: ohn, Label: I - ACCOUNTNUM, Score: 0.3657
Entity: ., Label: I - USERNAME, Score: 0.5871
Entity: do, Label: I - USERNAME, Score: 0.5350
Entity: Ġ555, Label: I - ACCOUNTNUM, Score: 0.8399
Entity: -, Label: I - SOCIALNUM, Score: 0.5948
Entity: 123, Label: I - SOCIALNUM, Score: 0.6309
Entity: -, Label: I - SOCIALNUM, Score: 0.6151
Entity: 45, Label: I - SOCIALNUM, Score: 0.3742
Entity: 67, Label: I - TELEPHONENUM, Score: 0.3440
📚 Documentation
Model Details
Property |
Details |
Model Type |
Token Classification |
Base Model |
answerdotai/ModernBERT - base |
Number of Labels |
18 (17 PII categories + "O" for non - PII tokens) |
Training Details
Dataset
The model was trained on the ai4privacy/pii - masking - 400k
dataset, which contains 400,000 examples of text with annotated PII tokens.
Training Configuration
- Batch Size: 32
- Learning Rate: 5e - 5
- Epochs: 4
- Optimizer: AdamW
- Weight Decay: 0.01
- Scheduler: Linear learning rate scheduler
Evaluation Metrics
The model was evaluated using the following metrics:
- Precision
- Recall
- F1 Score
- Accuracy
Epoch |
Training Loss |
Validation Loss |
Precision |
Recall |
F1 |
Accuracy |
1 |
0.017100 |
0.017944 |
0.897562 |
0.905612 |
0.901569 |
0.993549 |
2 |
0.011300 |
0.014114 |
0.915451 |
0.923319 |
0.919368 |
0.994782 |
3 |
0.005000 |
0.015703 |
0.919432 |
0.928394 |
0.923892 |
0.995136 |
4 |
0.001000 |
0.022899 |
0.921234 |
0.927212 |
0.924213 |
0.995267 |
📄 License
This model is licensed under the Commons Clause Apache License 2.0. For more details, see the Commons Clause website. For another license, contact the author.
Author
- Name: Sébastien Campion
- Email: sebastien.campion@foss4.eu
- Date: 2025 - 01 - 30
- Version: 0.1
Citation
If you use this model in your work, please cite it as follows:
@misc{piiranha2025,
author = {Sébastien Campion},
title = {PII - RANHA: A Privacy - Preserving Token Classification Model},
year = {2025},
version = {0.1},
url = {https://huggingface.co/sebastien - campion/piiranha},
}
Disclaimer
This model is provided "as - is" without any guarantees of performance or suitability for specific use cases. Always evaluate the model's performance in your specific context before deployment.