pii-entity-extractor Open-source Model - Free Deployment for Accurately Detecting Personal Identity Information in Texts

Pii Entity Extractor

Developed by AI-Enthusiast11

A named entity recognition model fine-tuned based on DeBERTa, specifically designed to detect personal identifiable information (PII) in text, such as names, social security numbers, phone numbers, and other sensitive information.

Sequence Labeling

Transformers

Other#Sensitive information identification #Privacy protection #Financial-grade NER

Downloads 155

Release Time : 4/25/2025

Model Overview

This model performs sequence labeling through token-level classification, accurately identifying various types of personal identifiable information entities in text, suitable for privacy protection and data anonymization scenarios.

Model Features

High-precision PII detection

Achieves an F1 score above 0.95 on test data, accurately identifying multiple PII types

Multi-category entity recognition

Supports detection of 7 PII types including names, social security numbers, phone numbers, credit card numbers, and addresses

Subword merging processing

Built-in post-processing logic automatically merges split subword tokens

Model Capabilities

Sensitive information detection in text

Named entity recognition

Data anonymization processing

Privacy protection

Use Cases

Privacy protection

Document anonymization

Automatically identifies and replaces sensitive information in documents

Implements automated data anonymization processes

Compliance review

Detects content in text that may violate privacy regulations

Helps organizations meet compliance requirements such as GDPR

Data security

Log sanitization

Removes sensitive information before storing logs

Reduces data breach risks

🚀 Model Card for PII Detection with DeBERTa

This model is a fine - tuned version of DeBERTa for Named Entity Recognition (NER), specifically designed to detect Personally Identifiable Information (PII) in text.

🚀 Quick Start

This model is a fine - tuned version of microsoft/deberta for Named Entity Recognition (NER). It's specifically crafted to detect Personally Identifiable Information (PII) entities such as names, SSNs, phone numbers, credit card numbers, addresses, and more.

✨ Features

Based on the powerful DeBERTa architecture.
Specialized for PII detection through fine - tuning on a custom dataset.
Performs sequence labeling using token - level classification.

📦 Installation

To use this model, you need to install the transformers library. You can install it using pip:

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_name = "AI-Enthusiast11/pii-entity-extractor"  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Post processing logic to combine the subword tokens
def merge_tokens(ner_results):
    entities = {}
    for entity in ner_results:
        entity_type = entity["entity_group"]
        entity_value = entity["word"].replace("##", "")  # Remove subword prefixes

        # Handle token merging
        if entity_type not in entities:
            entities[entity_type] = []
        if entities[entity_type] and not entity_value.startswith(" "):
            # If the previous token exists and this one isn't a new word, merge it
            entities[entity_type][-1] += entity_value
        else:
            entities[entity_type].append(entity_value)

    return entities

def redact_text_with_labels(text):
    ner_results = nlp(text)

    # Merge tokens for multi-token entities (if any)
    cleaned_entities = merge_tokens(ner_results)

    redacted_text = text
    for entity_type, values in cleaned_entities.items():
        for value in values:
            # Replace each identified entity with the label
            redacted_text = redacted_text.replace(value, f"[{entity_type}]")

    return redacted_text


#Loading the pipeline
nlp = pipeline("ner", model=model_name, tokenizer=tokenizer, aggregation_strategy="simple")

# Example input (choose one from your examples)
example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."

# Run pipeline and process result
ner_results = nlp(example)
cleaned_entities = merge_tokens(ner_results)

# Print the NER results
print("\n==NER Results:==\n")
for entity_type, values in cleaned_entities.items():
    print(f"  {entity_type}: {', '.join(values)}")

# Redact the single example with labels
redacted_example = redact_text_with_labels(example)

# Print the redacted result
print(f"\n==Redacted Example:==\n{redacted_example}")

📚 Documentation

Model Details

Model Description

This transformer - based model is fine - tuned on a custom dataset to detect sensitive information, commonly categorized as PII. The model performs sequence labeling to identify entities using token - level classification.

Property	Details
Developed by	[Privatone]
Finetuned from model	`microsoft/deberta`
Model Type	Token Classification (NER)
Language(s)	English
Use case	PII detection in text

Training Details

Training Data

The model was fine - tuned on a custom dataset containing labeled examples of the following PII entity types:

NAME
SSN
PHONE - NO
CREDIT - CARD - NO
BANK - ACCOUNT - NO
BANK - ROUTING - NO
ADDRESS

Epoch Logs

Epoch	Train Loss	Val Loss	Precision	Recall	F1	Accuracy
1	0.3672	0.1987	0.7806	0.8114	0.7957	0.9534
2	0.1149	0.1011	0.9161	0.9772	0.9457	0.9797
3	0.0795	0.0889	0.9264	0.9825	0.9536	0.9813
4	0.0708	0.0880	0.9242	0.9842	0.9533	0.9806
5	0.0626	0.0858	0.9235	0.9851	0.9533	0.9806

SeqEval Classification Report

Label	Precision	Recall	F1 - score	Support
ADDRESS	0.91	0.94	0.92	77
BANK - ACCOUNT - NO	0.91	0.99	0.95	169
BANK - ROUTING - NO	0.85	0.96	0.90	104
CREDIT - CARD - NO	0.95	1.00	0.97	228
NAME	0.98	0.97	0.97	164
PHONE - NO	0.94	0.99	0.96	308
SSN	0.87	1.00	0.93	90

Summary

Micro avg: 0.95
Macro avg: 0.95
Weighted avg: 0.95

Evaluation

Testing Data

Evaluation was done on a held - out portion of the same labeled dataset.

Metrics

Precision
Recall
F1 (via seqeval)
Entity - wise breakdown
Token - level accuracy

Results

F1 - score consistently above 0.95 for most labels, showing robustness in PII detection.

Recommendations

⚠️ Important Note

Use human review in high - risk environments.

Evaluate on your own domain - specific data before deployment.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご