pii-entity-extractor開源模型 - 免費部署精準檢測文本中個人身份信息

首頁

Pii Entity Extractor

由AI-Enthusiast11開發

基於DeBERTa微調的命名實體識別模型，專門用於檢測文本中的個人身份信息(PII)，如姓名、社保號碼、電話號碼等敏感信息。

序列標註

Transformers

其他#敏感信息識別 #隱私保護 #金融級NER

下載量 155

發布時間 : 4/25/2025

模型概述

該模型通過標記級分類進行序列標註，能準確識別文本中的各類個人身份信息實體，適用於隱私保護和數據脫敏場景。

模型特點

高精度PII檢測

在測試數據上F1值達到0.95以上，能準確識別多種PII類型

多類別實體識別

支持姓名、社保號碼、電話號碼、信用卡號、地址等7類PII檢測

子詞合併處理

內置後處理邏輯可自動合併被拆分的子詞標記

模型能力

文本中的敏感信息檢測

命名實體識別

數據脫敏處理

隱私保護

使用案例

隱私保護

文檔脫敏

自動識別並替換文檔中的敏感信息

實現自動化數據脫敏流程

合規審查

檢測文本中可能違反隱私法規的內容

幫助組織滿足GDPR等合規要求

數據安全

日誌清洗

在存儲日誌前移除敏感信息

降低數據洩露風險

🚀 使用DeBERTa進行PII檢測的模型

本模型是一個專為命名實體識別（NER）任務微調的模型，基於microsoft/deberta 。它能夠精準檢測各類個人身份信息（PII）實體，如姓名、社保號碼、電話號碼、信用卡號、地址等，在保護個人隱私和數據安全方面具有重要價值。

🚀 快速開始

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_name = "AI-Enthusiast11/pii-entity-extractor"  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Post processing logic to combine the subword tokens
def merge_tokens(ner_results):
    entities = {}
    for entity in ner_results:
        entity_type = entity["entity_group"]
        entity_value = entity["word"].replace("##", "")  # Remove subword prefixes

        # Handle token merging
        if entity_type not in entities:
            entities[entity_type] = []
        if entities[entity_type] and not entity_value.startswith(" "):
            # If the previous token exists and this one isn't a new word, merge it
            entities[entity_type][-1] += entity_value
        else:
            entities[entity_type].append(entity_value)

    return entities

def redact_text_with_labels(text):
    ner_results = nlp(text)

    # Merge tokens for multi-token entities (if any)
    cleaned_entities = merge_tokens(ner_results)

    redacted_text = text
    for entity_type, values in cleaned_entities.items():
        for value in values:
            # Replace each identified entity with the label
            redacted_text = redacted_text.replace(value, f"[{entity_type}]")

    return redacted_text

#Loading the pipeline
nlp = pipeline("ner", model=model_name, tokenizer=tokenizer, aggregation_strategy="simple")

# Example input (choose one from your examples)
example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."

# Run pipeline and process result
ner_results = nlp(example)
cleaned_entities = merge_tokens(ner_results)

# Print the NER results
print("\n==NER Results:==\n")
for entity_type, values in cleaned_entities.items():
    print(f"  {entity_type}: {', '.join(values)}")

# Redact the single example with labels
redacted_example = redact_text_with_labels(example)

# Print the redacted result
print(f"\n==Redacted Example:==\n{redacted_example}")

✨ 主要特性

精準檢測：能夠準確識別多種個人身份信息（PII）實體，包括姓名、社保號碼、電話號碼等。
微調優化：基於microsoft/deberta模型進行微調，在PII檢測任務上表現出色。
易於使用：提供了清晰的代碼示例，方便用戶快速上手。

📦 安裝指南

文檔未提及安裝步驟，故跳過該章節。

💻 使用示例

基礎用法

# 以下是使用模型進行PII檢測的基礎代碼示例
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_name = "AI-Enthusiast11/pii-entity-extractor"  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# 加載pipeline
nlp = pipeline("ner", model=model_name, tokenizer=tokenizer, aggregation_strategy="simple")

# 示例輸入
example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."

# 運行pipeline並處理結果
ner_results = nlp(example)

高級用法

# 以下代碼展示瞭如何對檢測結果進行後處理，合併子詞標記並對文本進行脫敏處理
# Post processing logic to combine the subword tokens
def merge_tokens(ner_results):
    entities = {}
    for entity in ner_results:
        entity_type = entity["entity_group"]
        entity_value = entity["word"].replace("##", "")  # Remove subword prefixes

        # Handle token merging
        if entity_type not in entities:
            entities[entity_type] = []
        if entities[entity_type] and not entity_value.startswith(" "):
            # If the previous token exists and this one isn't a new word, merge it
            entities[entity_type][-1] += entity_value
        else:
            entities[entity_type].append(entity_value)

    return entities

def redact_text_with_labels(text):
    ner_results = nlp(text)

    # Merge tokens for multi-token entities (if any)
    cleaned_entities = merge_tokens(ner_results)

    redacted_text = text
    for entity_type, values in cleaned_entities.items():
        for value in values:
            # Replace each identified entity with the label
            redacted_text = redacted_text.replace(value, f"[{entity_type}]")

    return redacted_text

# 示例輸入
example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."

# 運行pipeline並處理結果
ner_results = nlp(example)
cleaned_entities = merge_tokens(ner_results)

# 打印NER結果
print("\n==NER Results:==\n")
for entity_type, values in cleaned_entities.items():
    print(f"  {entity_type}: {', '.join(values)}")

# 對示例文本進行脫敏處理
redacted_example = redact_text_with_labels(example)

# 打印脫敏後的結果
print(f"\n==Redacted Example:==\n{redacted_example}")

📚 詳細文檔

模型詳情

模型描述

這是一個基於Transformer架構的模型，通過在自定義數據集上進行微調，能夠有效檢測敏感信息，通常歸類為個人身份信息（PII）。該模型採用序列標註的方式，通過標記級別的分類來識別實體。

屬性	詳情
模型類型	標記分類（NER）
開發團隊	[Privatone]
微調基礎模型	`microsoft/deberta`
支持語言	英語
使用場景	文本中的PII檢測

訓練詳情

訓練數據

模型在包含以下PII實體類型的自定義標註數據集上進行了微調：

姓名（NAME）
社保號碼（SSN）
電話號碼（PHONE-NO）
信用卡號（CREDIT-CARD-NO）
銀行賬號（BANK-ACCOUNT-NO）
銀行路由號（BANK-ROUTING-NO）
地址（ADDRESS）

訓練輪次日誌

輪次	訓練損失	驗證損失	精確率	召回率	F1值	準確率
1	0.3672	0.1987	0.7806	0.8114	0.7957	0.9534
2	0.1149	0.1011	0.9161	0.9772	0.9457	0.9797
3	0.0795	0.0889	0.9264	0.9825	0.9536	0.9813
4	0.0708	0.0880	0.9242	0.9842	0.9533	0.9806
5	0.0626	0.0858	0.9235	0.9851	0.9533	0.9806

SeqEval分類報告

標籤	精確率	召回率	F1分數	樣本數
地址（ADDRESS）	0.91	0.94	0.92	77
銀行賬號（BANK-ACCOUNT-NO）	0.91	0.99	0.95	169
銀行路由號（BANK-ROUTING-NO）	0.85	0.96	0.90	104
信用卡號（CREDIT-CARD-NO）	0.95	1.00	0.97	228
姓名（NAME）	0.98	0.97	0.97	164
電話號碼（PHONE-NO）	0.94	0.99	0.96	308
社保號碼（SSN）	0.87	1.00	0.93	90