🚀 Model Card for PII Detection with DeBERTa
This model is a fine - tuned version of DeBERTa for Named Entity Recognition (NER), specifically designed to detect Personally Identifiable Information (PII) in text.

🚀 Quick Start
This model is a fine - tuned version of microsoft/deberta
for Named Entity Recognition (NER). It's specifically crafted to detect Personally Identifiable Information (PII) entities such as names, SSNs, phone numbers, credit card numbers, addresses, and more.
✨ Features
- Based on the powerful DeBERTa architecture.
- Specialized for PII detection through fine - tuning on a custom dataset.
- Performs sequence labeling using token - level classification.
📦 Installation
To use this model, you need to install the transformers
library. You can install it using pip
:
pip install transformers
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_name = "AI-Enthusiast11/pii-entity-extractor"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
def merge_tokens(ner_results):
entities = {}
for entity in ner_results:
entity_type = entity["entity_group"]
entity_value = entity["word"].replace("##", "")
if entity_type not in entities:
entities[entity_type] = []
if entities[entity_type] and not entity_value.startswith(" "):
entities[entity_type][-1] += entity_value
else:
entities[entity_type].append(entity_value)
return entities
def redact_text_with_labels(text):
ner_results = nlp(text)
cleaned_entities = merge_tokens(ner_results)
redacted_text = text
for entity_type, values in cleaned_entities.items():
for value in values:
redacted_text = redacted_text.replace(value, f"[{entity_type}]")
return redacted_text
nlp = pipeline("ner", model=model_name, tokenizer=tokenizer, aggregation_strategy="simple")
example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."
ner_results = nlp(example)
cleaned_entities = merge_tokens(ner_results)
print("\n==NER Results:==\n")
for entity_type, values in cleaned_entities.items():
print(f" {entity_type}: {', '.join(values)}")
redacted_example = redact_text_with_labels(example)
print(f"\n==Redacted Example:==\n{redacted_example}")
📚 Documentation
Model Details
Model Description
This transformer - based model is fine - tuned on a custom dataset to detect sensitive information, commonly categorized as PII. The model performs sequence labeling to identify entities using token - level classification.
Property |
Details |
Developed by |
[Privatone] |
Finetuned from model |
microsoft/deberta |
Model Type |
Token Classification (NER) |
Language(s) |
English |
Use case |
PII detection in text |
Training Details
Training Data
The model was fine - tuned on a custom dataset containing labeled examples of the following PII entity types:
- NAME
- SSN
- PHONE - NO
- CREDIT - CARD - NO
- BANK - ACCOUNT - NO
- BANK - ROUTING - NO
- ADDRESS
Epoch Logs
Epoch |
Train Loss |
Val Loss |
Precision |
Recall |
F1 |
Accuracy |
1 |
0.3672 |
0.1987 |
0.7806 |
0.8114 |
0.7957 |
0.9534 |
2 |
0.1149 |
0.1011 |
0.9161 |
0.9772 |
0.9457 |
0.9797 |
3 |
0.0795 |
0.0889 |
0.9264 |
0.9825 |
0.9536 |
0.9813 |
4 |
0.0708 |
0.0880 |
0.9242 |
0.9842 |
0.9533 |
0.9806 |
5 |
0.0626 |
0.0858 |
0.9235 |
0.9851 |
0.9533 |
0.9806 |
SeqEval Classification Report
Label |
Precision |
Recall |
F1 - score |
Support |
ADDRESS |
0.91 |
0.94 |
0.92 |
77 |
BANK - ACCOUNT - NO |
0.91 |
0.99 |
0.95 |
169 |
BANK - ROUTING - NO |
0.85 |
0.96 |
0.90 |
104 |
CREDIT - CARD - NO |
0.95 |
1.00 |
0.97 |
228 |
NAME |
0.98 |
0.97 |
0.97 |
164 |
PHONE - NO |
0.94 |
0.99 |
0.96 |
308 |
SSN |
0.87 |
1.00 |
0.93 |
90 |
Summary
- Micro avg: 0.95
- Macro avg: 0.95
- Weighted avg: 0.95
Evaluation
Testing Data
Evaluation was done on a held - out portion of the same labeled dataset.
Metrics
- Precision
- Recall
- F1 (via seqeval)
- Entity - wise breakdown
- Token - level accuracy
Results
- F1 - score consistently above 0.95 for most labels, showing robustness in PII detection.
Recommendations
⚠️ Important Note
- Use human review in high - risk environments.
- Evaluate on your own domain - specific data before deployment.