模型概述
模型特點
模型能力
使用案例
🚀 deberta-v3-large-自我披露檢測
該模型用於檢測句子中的自我披露信息(個人信息),這是一個多類別標記分類任務,類似於採用IOB2格式的命名實體識別(NER)。它能為保護個人隱私、識別敏感信息提供有力支持。
🚀 快速開始
此模型用於檢測句子中的自我披露信息(個人信息),是一個多類別標記分類任務,類似採用IOB2格式的NER。例如,句子 "I am 22 years old and ..." 的標籤為 "["B - Age", "I - Age", "I - Age", "I - Age", "I - Age", "O", ...]" 。
該模型能夠檢測以下17個類別:"Age"(年齡)、"Age_Gender"(年齡與性別)、"Appearance"(外貌)、"Education"(教育程度)、"Family"(家庭情況)、"Finance"(財務狀況)、"Gender"(性別)、"Health"(健康狀況)、"Husband_BF"(丈夫/男友)、"Location"(地理位置)、"Mental_Health"(心理健康)、"Occupation"(職業)、"Pet"(寵物)、"Race_Nationality"(種族/國籍)、"Relationship_Status"(感情狀況)、"Sexual_Orientation"(性取向)、"Wife_GF"(妻子/女友)。
如需更多詳細信息,請閱讀論文:Reducing Privacy Risks in Online Self - Disclosures with Language Models 。
使用此模型意味著自動同意以下準則:
- 僅將模型用於研究目的。
- 未經作者同意,不得重新分發。
- 任何使用此模型創建的衍生作品必須承認原作者。
✨ 主要特性
- 能夠精準檢測句子中17個類別的自我披露信息。
- 在相關評估指標上表現出色,部分跨度F1值達到65.71,優於提示GPT - 4(F1值為57.68)。
📦 安裝指南
文檔中未提及具體安裝步驟,故跳過該章節。
💻 使用示例
基礎用法
import torch
from torch.utils.data import DataLoader, Dataset
import datasets
from datasets import ClassLabel, load_dataset
from transformers import AutoModelForTokenClassification, AutoTokenizer, AutoConfig, DataCollatorForTokenClassification
model_path = "douy/deberta-v3-large-self-disclosure-detection"
config = AutoConfig.from_pretrained(model_path,)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True,)
model = AutoModelForTokenClassification.from_pretrained(model_path,config=config,device_map="cuda:0").eval()
label2id = config.label2id
id2label = config.id2label
def tokenize_and_align_labels(words):
tokenized_inputs = tokenizer(
words,
padding=False,
is_split_into_words=True,
)
# we use ("O") for all the labels
word_ids = tokenized_inputs.word_ids(0)
previous_word_idx = None
label_ids = []
for word_idx in word_ids:
# Special tokens have a word id that is None. We set the label to -100 so they are automatically
# ignored in the loss function.
if word_idx is None:
label_ids.append(-100)
# We set the label for the first token of each word.
elif word_idx != previous_word_idx:
label_ids.append(label2id["O"])
# For the other tokens in a word, we set the label to -100
else:
label_ids.append(-100)
previous_word_idx = word_idx
tokenized_inputs["labels"] = label_ids
return tokenized_inputs
class DisclosureDataset(Dataset):
def __init__(self, inputs, tokenizer, tokenize_and_align_labels_function):
self.inputs = inputs
self.tokenizer = tokenizer
self.tokenize_and_align_labels_function = tokenize_and_align_labels_function
def __len__(self):
return len(self.inputs)
def __getitem__(self, idx):
words = self.inputs[idx]
tokenized_inputs = self.tokenize_and_align_labels_function(words)
return tokenized_inputs
sentences = [
"I am a 23-year-old who is currently going through the last leg of undergraduate school.",
"My husband and I live in US.",
]
inputs = [sentence.split() for sentence in sentences]
data_collator = DataCollatorForTokenClassification(tokenizer)
dataset = DisclosureDataset(inputs, tokenizer, tokenize_and_align_labels)
dataloader = DataLoader(dataset, collate_fn=data_collator, batch_size=2)
total_predictions = []
for step, batch in enumerate(dataloader):
batch = {k: v.to(model.device) for k, v in batch.items()}
with torch.inference_mode():
outputs = model(**batch)
predictions = outputs.logits.argmax(-1)
labels = batch["labels"]
predictions = predictions.cpu().tolist()
labels = labels.cpu().tolist()
true_predictions = []
for i, label in enumerate(labels):
true_pred = []
for j, m in enumerate(label):
if m != -100:
true_pred.append(id2label[predictions[i][j]])
true_predictions.append(true_pred)
total_predictions.extend(true_predictions)
for word, pred in zip(inputs, total_predictions):
for w, p in zip(word, pred):
print(w, p)
📚 詳細文檔
模型描述
屬性 | 詳情 |
---|---|
模型類型 | 一個可以檢測17個類別自我披露信息的微調模型 |
語言(NLP) | 英語 |
許可證 | 知識共享署名 - 非商業性使用許可協議 |
微調基礎模型 | microsoft/deberta - v3 - large |
評估
該模型的部分跨度F1值達到65.71,優於提示GPT - 4(F1值為57.68)。有關每個類別的詳細性能,請參閱論文。
🔧 技術細節
文檔中未提及具體技術細節內容,故跳過該章節。
📄 許可證
該模型使用的許可證為知識共享署名 - 非商業性使用許可協議(Creative Commons Attribution - NonCommercial)。
📚 引用
@article{dou2023reducing,
title={Reducing Privacy Risks in Online Self-Disclosures with Language Models},
author={Dou, Yao and Krsek, Isadora and Naous, Tarek and Kabra, Anubha and Das, Sauvik and Ritter, Alan and Xu, Wei},
journal={arXiv preprint arXiv:2311.09538},
year={2023}
}








