pii-entity-extractor开源模型 - 免费部署精准检测文本中个人身份信息

首页

Pii Entity Extractor

由 AI-Enthusiast11 开发

基于DeBERTa微调的命名实体识别模型，专门用于检测文本中的个人身份信息(PII)，如姓名、社保号码、电话号码等敏感信息。

序列标注

Transformers

其他#敏感信息识别 #隐私保护 #金融级NER

下载量 155

发布时间 : 4/25/2025

模型简介

该模型通过标记级分类进行序列标注，能准确识别文本中的各类个人身份信息实体，适用于隐私保护和数据脱敏场景。

模型特点

高精度PII检测

在测试数据上F1值达到0.95以上，能准确识别多种PII类型

多类别实体识别

支持姓名、社保号码、电话号码、信用卡号、地址等7类PII检测

子词合并处理

内置后处理逻辑可自动合并被拆分的子词标记

模型能力

文本中的敏感信息检测

命名实体识别

数据脱敏处理

隐私保护

使用案例

隐私保护

文档脱敏

自动识别并替换文档中的敏感信息

实现自动化数据脱敏流程

合规审查

检测文本中可能违反隐私法规的内容

帮助组织满足GDPR等合规要求

数据安全

日志清洗

在存储日志前移除敏感信息

降低数据泄露风险

🚀 使用DeBERTa进行PII检测的模型

本模型是一个专为命名实体识别（NER）任务微调的模型，基于microsoft/deberta 。它能够精准检测各类个人身份信息（PII）实体，如姓名、社保号码、电话号码、信用卡号、地址等，在保护个人隐私和数据安全方面具有重要价值。

🚀 快速开始

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_name = "AI-Enthusiast11/pii-entity-extractor"  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Post processing logic to combine the subword tokens
def merge_tokens(ner_results):
    entities = {}
    for entity in ner_results:
        entity_type = entity["entity_group"]
        entity_value = entity["word"].replace("##", "")  # Remove subword prefixes

        # Handle token merging
        if entity_type not in entities:
            entities[entity_type] = []
        if entities[entity_type] and not entity_value.startswith(" "):
            # If the previous token exists and this one isn't a new word, merge it
            entities[entity_type][-1] += entity_value
        else:
            entities[entity_type].append(entity_value)

    return entities

def redact_text_with_labels(text):
    ner_results = nlp(text)

    # Merge tokens for multi-token entities (if any)
    cleaned_entities = merge_tokens(ner_results)

    redacted_text = text
    for entity_type, values in cleaned_entities.items():
        for value in values:
            # Replace each identified entity with the label
            redacted_text = redacted_text.replace(value, f"[{entity_type}]")

    return redacted_text

#Loading the pipeline
nlp = pipeline("ner", model=model_name, tokenizer=tokenizer, aggregation_strategy="simple")

# Example input (choose one from your examples)
example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."

# Run pipeline and process result
ner_results = nlp(example)
cleaned_entities = merge_tokens(ner_results)

# Print the NER results
print("\n==NER Results:==\n")
for entity_type, values in cleaned_entities.items():
    print(f"  {entity_type}: {', '.join(values)}")

# Redact the single example with labels
redacted_example = redact_text_with_labels(example)

# Print the redacted result
print(f"\n==Redacted Example:==\n{redacted_example}")

✨ 主要特性

精准检测：能够准确识别多种个人身份信息（PII）实体，包括姓名、社保号码、电话号码等。
微调优化：基于microsoft/deberta模型进行微调，在PII检测任务上表现出色。
易于使用：提供了清晰的代码示例，方便用户快速上手。

📦 安装指南

文档未提及安装步骤，故跳过该章节。

💻 使用示例

基础用法

# 以下是使用模型进行PII检测的基础代码示例
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_name = "AI-Enthusiast11/pii-entity-extractor"  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# 加载pipeline
nlp = pipeline("ner", model=model_name, tokenizer=tokenizer, aggregation_strategy="simple")

# 示例输入
example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."

# 运行pipeline并处理结果
ner_results = nlp(example)

高级用法

# 以下代码展示了如何对检测结果进行后处理，合并子词标记并对文本进行脱敏处理
# Post processing logic to combine the subword tokens
def merge_tokens(ner_results):
    entities = {}
    for entity in ner_results:
        entity_type = entity["entity_group"]
        entity_value = entity["word"].replace("##", "")  # Remove subword prefixes

        # Handle token merging
        if entity_type not in entities:
            entities[entity_type] = []
        if entities[entity_type] and not entity_value.startswith(" "):
            # If the previous token exists and this one isn't a new word, merge it
            entities[entity_type][-1] += entity_value
        else:
            entities[entity_type].append(entity_value)

    return entities

def redact_text_with_labels(text):
    ner_results = nlp(text)

    # Merge tokens for multi-token entities (if any)
    cleaned_entities = merge_tokens(ner_results)

    redacted_text = text
    for entity_type, values in cleaned_entities.items():
        for value in values:
            # Replace each identified entity with the label
            redacted_text = redacted_text.replace(value, f"[{entity_type}]")

    return redacted_text

# 示例输入
example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."

# 运行pipeline并处理结果
ner_results = nlp(example)
cleaned_entities = merge_tokens(ner_results)

# 打印NER结果
print("\n==NER Results:==\n")
for entity_type, values in cleaned_entities.items():
    print(f"  {entity_type}: {', '.join(values)}")

# 对示例文本进行脱敏处理
redacted_example = redact_text_with_labels(example)

# 打印脱敏后的结果
print(f"\n==Redacted Example:==\n{redacted_example}")

📚 详细文档

模型详情

模型描述

这是一个基于Transformer架构的模型，通过在自定义数据集上进行微调，能够有效检测敏感信息，通常归类为个人身份信息（PII）。该模型采用序列标注的方式，通过标记级别的分类来识别实体。

属性	详情
模型类型	标记分类（NER）
开发团队	[Privatone]
微调基础模型	`microsoft/deberta`
支持语言	英语
使用场景	文本中的PII检测

训练详情

训练数据

模型在包含以下PII实体类型的自定义标注数据集上进行了微调：

姓名（NAME）
社保号码（SSN）
电话号码（PHONE-NO）
信用卡号（CREDIT-CARD-NO）
银行账号（BANK-ACCOUNT-NO）
银行路由号（BANK-ROUTING-NO）
地址（ADDRESS）

训练轮次日志

轮次	训练损失	验证损失	精确率	召回率	F1值	准确率
1	0.3672	0.1987	0.7806	0.8114	0.7957	0.9534
2	0.1149	0.1011	0.9161	0.9772	0.9457	0.9797
3	0.0795	0.0889	0.9264	0.9825	0.9536	0.9813
4	0.0708	0.0880	0.9242	0.9842	0.9533	0.9806
5	0.0626	0.0858	0.9235	0.9851	0.9533	0.9806

SeqEval分类报告

标签	精确率	召回率	F1分数	样本数
地址（ADDRESS）	0.91	0.94	0.92	77
银行账号（BANK-ACCOUNT-NO）	0.91	0.99	0.95	169
银行路由号（BANK-ROUTING-NO）	0.85	0.96	0.90	104
信用卡号（CREDIT-CARD-NO）	0.95	1.00	0.97	228
姓名（NAME）	0.98	0.97	0.97	164
电话号码（PHONE-NO）	0.94	0.99	0.96	308
社保号码（SSN）	0.87	1.00	0.93	90