pii-entity-extractorオープンソースモデル - 無料でデプロイ可能、テキスト内の個人識別情報を高精度で検出

ホーム

Pii Entity Extractor

AI-Enthusiast11によって開発

DeBERTaをファインチューニングした固有表現認識モデルで、テキスト内の個人識別情報(PII)（氏名、社会保障番号、電話番号などの機密情報）を検出するために特別に設計されています。

シーケンスラベリング

Transformers

その他#機密情報識別 #プライバシー保護 #金融向けNER

ダウンロード数 155

リリース時間 : 4/25/2025

モデル概要

このモデルはトークンレベルの分類による系列ラベリングを行い、テキスト内の様々な個人識別情報エンティティを正確に識別できます。プライバシー保護やデータマスキングのシナリオに適しています。

モデル特徴

高精度PII検出

テストデータでF1値0.95以上を達成し、複数のPIIタイプを正確に識別可能

多カテゴリエンティティ認識

氏名、社会保障番号、電話番号、クレジットカード番号、住所など7種類のPII検出をサポート

サブワード統合処理

組み込みの後処理ロジックにより分割されたサブワードトークンを自動統合

モデル能力

テキスト内の機密情報検出

固有表現認識

データマスキング処理

プライバシー保護

使用事例

プライバシー保護

文書マスキング

文書内の機密情報を自動識別して置換

自動化データマスキングプロセスの実現

コンプライアンス審査

プライバシー規制に違反する可能性のあるテキスト内容を検出

GDPRなどのコンプライアンス要件を満たす支援

データセキュリティ

ログクリーニング

ログ保存前に機密情報を削除

データ漏洩リスクの低減

🚀 DeBERTaを用いたPII検出モデルカード

このモデルは、名前付きエンティティ認識（NER）用に微調整されたmicrosoft/debertaのバージョンです。特に、名前、社会保障番号（SSN）、電話番号、クレジットカード番号、住所などの個人識別情報（PII）エンティティの検出に設計されています。

🚀 クイックスタート

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_name = "AI-Enthusiast11/pii-entity-extractor"  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Post processing logic to combine the subword tokens
def merge_tokens(ner_results):
    entities = {}
    for entity in ner_results:
        entity_type = entity["entity_group"]
        entity_value = entity["word"].replace("##", "")  # Remove subword prefixes

        # Handle token merging
        if entity_type not in entities:
            entities[entity_type] = []
        if entities[entity_type] and not entity_value.startswith(" "):
            # If the previous token exists and this one isn't a new word, merge it
            entities[entity_type][-1] += entity_value
        else:
            entities[entity_type].append(entity_value)

    return entities

def redact_text_with_labels(text):
    ner_results = nlp(text)

    # Merge tokens for multi-token entities (if any)
    cleaned_entities = merge_tokens(ner_results)

    redacted_text = text
    for entity_type, values in cleaned_entities.items():
        for value in values:
            # Replace each identified entity with the label
            redacted_text = redacted_text.replace(value, f"[{entity_type}]")

    return redacted_text

#Loading the pipeline
nlp = pipeline("ner", model=model_name, tokenizer=tokenizer, aggregation_strategy="simple")

# Example input (choose one from your examples)
example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."

# Run pipeline and process result
ner_results = nlp(example)
cleaned_entities = merge_tokens(ner_results)

# Print the NER results
print("\n==NER Results:==\n")
for entity_type, values in cleaned_entities.items():
    print(f"  {entity_type}: {', '.join(values)}")

# Redact the single example with labels
redacted_example = redact_text_with_labels(example)

# Print the redacted result
print(f"\n==Redacted Example:==\n{redacted_example}")

✨ 主な機能

このトランスフォーマーベースのモデルは、PIIとして一般的に分類される機密情報を検出するために、カスタムデータセットで微調整されています。
トークンレベルの分類を使用してエンティティを識別するシーケンスラベリングを行います。

📦 インストール

このモデルを使用するには、transformersライブラリが必要です。以下のコマンドでインストールできます。

pip install transformers

📚 ドキュメント

モデルの詳細

モデルの説明

属性	详情
モデルタイプ	トークン分類（NER）
開発者	[Privatone]
微調整元のモデル	`microsoft/deberta`
言語	英語
ユースケース	テキスト内のPII検出

学習の詳細

学習データ

このモデルは、以下のPIIエンティティタイプのラベル付きサンプルを含むカスタムデータセットで微調整されました。

NAME
SSN
PHONE-NO
CREDIT-CARD-NO
BANK-ACCOUNT-NO
BANK-ROUTING-NO
ADDRESS

エポックログ

エポック	訓練損失	検証損失	適合率	再現率	F1値	正解率
1	0.3672	0.1987	0.7806	0.8114	0.7957	0.9534
2	0.1149	0.1011	0.9161	0.9772	0.9457	0.9797
3	0.0795	0.0889	0.9264	0.9825	0.9536	0.9813
4	0.0708	0.0880	0.9242	0.9842	0.9533	0.9806
5	0.0626	0.0858	0.9235	0.9851	0.9533	0.9806

SeqEval分類レポート

ラベル	適合率	再現率	F1値	サポート
ADDRESS	0.91	0.94	0.92	77
BANK-ACCOUNT-NO	0.91	0.99	0.95	169
BANK-ROUTING-NO	0.85	0.96	0.90	104
CREDIT-CARD-NO	0.95	1.00	0.97	228
NAME	0.98	0.97	0.97	164
PHONE-NO	0.94	0.99	0.96	308
SSN	0.87	1.00	0.93	90