Deberta-v3-large Open-source Model for Self-disclosure Detection - Supports Precise Identification of 17 Categories of Personal Information

Deberta V3 Large Self Disclosure Detection

Developed by douy

A model for detecting self-disclosure (personal information) in sentences, supporting recognition of 17 types of personal information

Sequence Labeling

Transformers

English#Personal Information Detection #Privacy Protection #Multi-category Tag Classification

Downloads 32

Release Time : 5/12/2024

Model Overview

This model is fine-tuned based on DeBERTa-v3-large, specifically designed to detect personal information disclosure in text, using a multi-category token classification method similar to named entity recognition.

Model Features

Multi-category Recognition

Can identify 17 types of personal information, including age, gender, occupation, location, etc.

High-precision Detection

Achieves a partial span F1 score of 65.71, outperforming GPT-4 prompting methods

Research-Only

Model usage is restricted to research purposes and must comply with strict usage guidelines

Model Capabilities

Text Token Classification

Personal Information Identification

Privacy Risk Detection

Use Cases

Privacy Protection

Social Media Content Analysis

Detect personal information unintentionally disclosed by users on social media

Identify potential privacy risk points

Privacy Compliance Check

Used by enterprises to inspect sensitive information in user-generated content

Help comply with data protection regulations

Academic Research

Online Behavior Research

Analyze users' self-disclosure patterns on the internet

Provide data support for psychological and sociological studies

🚀 DeBERTa-v3-Large Self-Disclosure Detection

This model is designed to detect self-disclosures (personal information) in sentences. It addresses a multi - class token classification task similar to NER in IOB2 format, offering a practical solution for privacy - related information identification.

🚀 Quick Start

The deberta - v3 - large - self - disclosure - detection model is a powerful tool for detecting self - disclosures in sentences. To start using it, follow the example code below.

✨ Features

Multi - Category Detection: Capable of detecting self - disclosures in 17 categories, including "Age", "Age_Gender", "Appearance", etc.
High Performance: Achieves a 65.71 partial span F1, outperforming prompting GPT - 4 (57.68 F1).

📦 Installation

There is no specific installation step provided in the original README. However, you need to have the necessary Python libraries installed such as torch, datasets, and transformers to run the example code. You can install them using pip:

pip install torch datasets transformers

💻 Usage Examples

Basic Usage

import torch
from torch.utils.data import DataLoader, Dataset

import datasets
from datasets import ClassLabel, load_dataset

from transformers import AutoModelForTokenClassification, AutoTokenizer, AutoConfig, DataCollatorForTokenClassification

model_path = "douy/deberta-v3-large-self-disclosure-detection"

config = AutoConfig.from_pretrained(model_path,)

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True,)

model = AutoModelForTokenClassification.from_pretrained(model_path,config=config,device_map="cuda:0").eval()

label2id = config.label2id
id2label = config.id2label


def tokenize_and_align_labels(words):
    tokenized_inputs = tokenizer(
                words,
                padding=False,
                is_split_into_words=True,
            )

    # we use ("O") for all the labels
    word_ids = tokenized_inputs.word_ids(0)
    previous_word_idx = None
    label_ids = []
    for word_idx in word_ids:
        # Special tokens have a word id that is None. We set the label to -100 so they are automatically
        # ignored in the loss function.
        if word_idx is None:
            label_ids.append(-100)
        # We set the label for the first token of each word.
        elif word_idx != previous_word_idx:
            label_ids.append(label2id["O"])
        # For the other tokens in a word, we set the label to -100
        else:
            label_ids.append(-100)
        previous_word_idx = word_idx
    tokenized_inputs["labels"] = label_ids
    return tokenized_inputs

class DisclosureDataset(Dataset):
    def __init__(self, inputs, tokenizer, tokenize_and_align_labels_function):
        self.inputs = inputs
        self.tokenizer = tokenizer
        self.tokenize_and_align_labels_function = tokenize_and_align_labels_function

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        words = self.inputs[idx]
        tokenized_inputs = self.tokenize_and_align_labels_function(words)
        return tokenized_inputs
    
    
sentences = [
    "I am a 23-year-old who is currently going through the last leg of undergraduate school.",
    "My husband and I live in US.",
]

inputs = [sentence.split() for sentence in sentences]

data_collator = DataCollatorForTokenClassification(tokenizer)

dataset = DisclosureDataset(inputs, tokenizer, tokenize_and_align_labels)

dataloader = DataLoader(dataset, collate_fn=data_collator, batch_size=2)

total_predictions = []
for step, batch in enumerate(dataloader):
    batch = {k: v.to(model.device) for k, v in batch.items()}
    with torch.inference_mode():
        outputs = model(**batch)
    predictions = outputs.logits.argmax(-1)
    labels = batch["labels"]

    predictions = predictions.cpu().tolist()
    labels = labels.cpu().tolist()

    true_predictions = []
    for i, label in enumerate(labels):
        true_pred = []
        for j, m in enumerate(label):
            if m != -100:
                true_pred.append(id2label[predictions[i][j]])
        true_predictions.append(true_pred)
    total_predictions.extend(true_predictions)
    

for word, pred in zip(inputs, total_predictions):
    for w, p in zip(word, pred):
        print(w, p)

📚 Documentation

Model Description

Property	Details
Model Type	A finetuned model that can detect self - disclosures in 17 categories.
Language(s) (NLP)	English
License	Creative Commons Attribution - NonCommercial
Finetuned from model	[microsoft/deberta - v3 - large](https://huggingface.co/microsoft/deberta - v3 - large)

Access Guidelines

⚠️ Important Note

Only use the model for research purposes.

No redistribution without the author's agreement.

Any derivative works created using this model must acknowledge the original author.

Evaluation

The model achieves 65.71 partial span F1, better than prompting GPT - 4 (57.68 F1). For detailed performance per category, see paper Reducing Privacy Risks in Online Self - Disclosures with Language Models.

📄 License

The model is licensed under the Creative Commons Attribution - NonCommercial license.

📚 Citation

@article{dou2023reducing,
  title={Reducing Privacy Risks in Online Self-Disclosures with Language Models},
  author={Dou, Yao and Krsek, Isadora and Naous, Tarek and Kabra, Anubha and Das, Sauvik and Ritter, Alan and Xu, Wei},
  journal={arXiv preprint arXiv:2311.09538},
  year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご