deberta-v3-large自己開示検出オープンソースモデル - 17種類の個人情報の高精度識別に対応

ホーム

Deberta V3 Large Self Disclosure Detection

douyによって開発

文中の自己開示（個人情報）を検出するモデルで、17種類の個人情報識別をサポート

シーケンスラベリング

Transformers

英語#個人情報検出 #プライバシー保護 #多カテゴリタグ分類

ダウンロード数 32

リリース時間 : 5/12/2024

モデル概要

このモデルはDeBERTa-v3-largeをファインチューニングしたもので、テキスト中の個人情報開示を検出するために特別に設計されており、名前付きエンティティ認識に似た多カテゴリタグ分類手法を採用しています。

モデル特徴

多カテゴリ識別

年齢、性別、職業、所在地など17種類の個人情報を識別可能

高精度検出

一部のスパンF1値で65.71を達成し、GPT-4プロンプト手法を上回る

研究専用

モデルの使用は研究目的に限定され、厳格な利用ガイドラインに従う必要がある

モデル能力

テキストタグ分類

個人情報識別

プライバシーリスク検出

使用事例

プライバシー保護

ソーシャルメディアコンテンツ分析

ユーザーがソーシャルメディアで意図せず開示した個人情報を検出

潜在的なプライバシーリスクポイントを識別

プライバシーコンプライアンスチェック

企業がユーザー生成コンテンツ中の機微情報をチェックするために使用

データ保護規制要件への適合を支援

学術研究

オンライン行動研究

ユーザーのインターネット上での自己開示パターンを分析

心理学や社会学研究にデータサポートを提供

🚀 deberta-v3-large-self-disclosure-detection

このモデルは、文章内の自己開示（個人情報）を検出するために使用されます。IOB2形式のNERのような多クラストークン分類タスクです。

🚀 クイックスタート

このモデルは、文章内の自己開示（個人情報）を検出するために使用されます。IOB2形式のNERのような多クラストークン分類タスクです。例えば、 "I am 22 years old and ..." は、 "["B-Age", "I-Age", "I-Age", "I-Age", "I-Age", "O", ...]" というラベルを持ちます。

このモデルは、以下の17カテゴリを検出できます： "Age", "Age_Gender", "Appearance", "Education", "Family", "Finance", "Gender", "Health", "Husband_BF", "Location", "Mental_Health", "Occupation", "Pet", "Race_Nationality", "Relationship_Status", "Sexual_Orientation", "Wife_GF"。

詳細については、論文 Reducing Privacy Risks in Online Self-Disclosures with Language Models をご覧ください。

このモデルにアクセスすることは、自動的に以下のガイドラインに同意することを意味します：

このモデルは研究目的のみで使用してください。
著者の同意なしに再配布しないでください。
このモデルを使用して作成された派生作品は、元の著者を明記する必要があります。

✨ 主な機能

文章内の自己開示（個人情報）を検出することができます。
17カテゴリの自己開示を検出できます。

📦 インストール

このREADMEには具体的なインストール手順が記載されていないため、このセクションは省略されます。

💻 使用例

基本的な使用法

import torch
from torch.utils.data import DataLoader, Dataset

import datasets
from datasets import ClassLabel, load_dataset

from transformers import AutoModelForTokenClassification, AutoTokenizer, AutoConfig, DataCollatorForTokenClassification

model_path = "douy/deberta-v3-large-self-disclosure-detection"

config = AutoConfig.from_pretrained(model_path,)

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True,)

model = AutoModelForTokenClassification.from_pretrained(model_path,config=config,device_map="cuda:0").eval()

label2id = config.label2id
id2label = config.id2label


def tokenize_and_align_labels(words):
    tokenized_inputs = tokenizer(
                words,
                padding=False,
                is_split_into_words=True,
            )

    # we use ("O") for all the labels
    word_ids = tokenized_inputs.word_ids(0)
    previous_word_idx = None
    label_ids = []
    for word_idx in word_ids:
        # Special tokens have a word id that is None. We set the label to -100 so they are automatically
        # ignored in the loss function.
        if word_idx is None:
            label_ids.append(-100)
        # We set the label for the first token of each word.
        elif word_idx != previous_word_idx:
            label_ids.append(label2id["O"])
        # For the other tokens in a word, we set the label to -100
        else:
            label_ids.append(-100)
        previous_word_idx = word_idx
    tokenized_inputs["labels"] = label_ids
    return tokenized_inputs

class DisclosureDataset(Dataset):
    def __init__(self, inputs, tokenizer, tokenize_and_align_labels_function):
        self.inputs = inputs
        self.tokenizer = tokenizer
        self.tokenize_and_align_labels_function = tokenize_and_align_labels_function

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        words = self.inputs[idx]
        tokenized_inputs = self.tokenize_and_align_labels_function(words)
        return tokenized_inputs
    
    
sentences = [
    "I am a 23-year-old who is currently going through the last leg of undergraduate school.",
    "My husband and I live in US.",
]

inputs = [sentence.split() for sentence in sentences]

data_collator = DataCollatorForTokenClassification(tokenizer)

dataset = DisclosureDataset(inputs, tokenizer, tokenize_and_align_labels)

dataloader = DataLoader(dataset, collate_fn=data_collator, batch_size=2)

total_predictions = []
for step, batch in enumerate(dataloader):
    batch = {k: v.to(model.device) for k, v in batch.items()}
    with torch.inference_mode():
        outputs = model(**batch)
    predictions = outputs.logits.argmax(-1)
    labels = batch["labels"]

    predictions = predictions.cpu().tolist()
    labels = labels.cpu().tolist()

    true_predictions = []
    for i, label in enumerate(labels):
        true_pred = []
        for j, m in enumerate(label):
            if m != -100:
                true_pred.append(id2label[predictions[i][j]])
        true_predictions.append(true_pred)
    total_predictions.extend(true_predictions)
    

for word, pred in zip(inputs, total_predictions):
    for w, p in zip(word, pred):
        print(w, p)

📚 ドキュメント

モデルの説明

属性	详情
モデルタイプ	17カテゴリの自己開示を検出できるファインチューニングされたモデル
言語	英語
ライセンス	Creative Commons Attribution-NonCommercial
ファインチューニング元のモデル	microsoft/deberta-v3-large

評価

このモデルは、部分スパンF1値が65.71を達成し、GPT-4のプロンプト（F1値57.68）よりも優れています。各カテゴリの詳細なパフォーマンスについては、論文を参照してください。

🔧 技術詳細

このREADMEには具体的な技術詳細が記載されていないため、このセクションは省略されます。

📄 ライセンス

このモデルはMITライセンスの下で提供されています。

引用

@article{dou2023reducing,
  title={Reducing Privacy Risks in Online Self-Disclosures with Language Models},
  author={Dou, Yao and Krsek, Isadora and Naous, Tarek and Kabra, Anubha and Das, Sauvik and Ritter, Alan and Xu, Wei},
  journal={arXiv preprint arXiv:2311.09538},
  year={2023}
}