🚀 PII檢測模型 - Phi3 Mini微調版
本倉庫包含一個針對檢測個人身份信息(PII)而微調的Phi3 Mini模型版本。該模型經過專門訓練,能夠識別文本中的各種PII實體,是數據編輯、隱私保護以及遵守數據保護法規等任務的強大工具。
🚀 快速開始
本模型可用於檢測文本中的個人身份信息(PII),在數據隱私保護等場景中發揮重要作用。以下為使用該模型的具體步驟。
✨ 主要特性
- 精準識別:能夠識別多種類型的PII實體,涵蓋個人信息、聯繫方式、地址信息等多個類別。
- 廣泛適用:適用於數據編輯、隱私保護以及數據保護法規合規等多種任務。
📦 安裝指南
若要使用此模型,你需要安裝transformers
庫:
pip install transformers
💻 使用示例
基礎用法
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("ab-ai/PII-Model-Phi3-Mini")
model = AutoModelForTokenClassification.from_pretrained("ab-ai/PII-Model-Phi3-Mini")
input_text = "Hi Abner, just a reminder that your next primary care appointment is on 23/03/1926. Please confirm by replying to this email Nathen15@hotmail.com."
model_prompt = f"""### Instruction:
Identify and extract the following PII entities from the text, if present: companyname, pin, currencyname, email, phoneimei, litecoinaddress, currency, eyecolor, street, mac, state, time, vehiclevin, jobarea, date, bic, currencysymbol, currencycode, age, nearbygpscoordinate, amount, ssn, ethereumaddress, zipcode, buildingnumber, dob, firstname, middlename, ordinaldirection, jobtitle, bitcoinaddress, jobtype, phonenumber, height, password, ip, useragent, accountname, city, gender, secondaryaddress, iban, sex, prefix, ipv4, maskednumber, url, username, lastname, creditcardcvv, county, vehiclevrm, ipv6, creditcardissuer, accountnumber, creditcardnumber. Return the output in JSON format.
### Input:
{input_text}
### Output: """
inputs = tokenizer(model_prompt, return_tensors="pt").to(device)
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=120)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
📚 詳細文檔
模型概述
模型架構
可檢測的PII實體
該模型能夠檢測以下PII實體:
-
個人信息:
firstname
(名字)
middlename
(中間名)
lastname
(姓氏)
sex
(性別)
dob
(出生日期)
age
(年齡)
gender
(性別)
height
(身高)
eyecolor
(眼睛顏色)
-
聯繫信息:
email
(電子郵件)
phonenumber
(電話號碼)
url
(網址)
username
(用戶名)
useragent
(用戶代理)
-
地址信息:
street
(街道)
city
(城市)
state
(州)
county
(縣)
zipcode
(郵政編碼)
country
(國家)
secondaryaddress
(二級地址)
buildingnumber
(樓號)
ordinaldirection
(方位)
-
地理信息:
nearbygpscoordinate
(附近的GPS座標)
-
組織信息:
companyname
(公司名稱)
jobtitle
(職位名稱)
jobarea
(工作領域)
jobtype
(工作類型)
-
財務信息:
accountname
(賬戶名稱)
accountnumber
(賬戶號碼)
creditcardnumber
(信用卡號碼)
creditcardcvv
(信用卡CVV碼)
creditcardissuer
(信用卡髮卡行)
iban
(國際銀行賬號)
bic
(銀行識別碼)
currency
(貨幣)
currencyname
(貨幣名稱)
currencysymbol
(貨幣符號)
currencycode
(貨幣代碼)
amount
(金額)
-
唯一標識符:
pin
(個人識別碼)
ssn
(社會安全號碼)
imei
(手機IMEI碼)
mac
(MAC地址)
vehiclevin
(車輛VIN碼)
vehiclevrm
(車輛VRM碼)
-
加密貨幣信息:
bitcoinaddress
(比特幣地址)
litecoinaddress
(萊特幣地址)
ethereumaddress
(以太坊地址)
-
其他信息:
ip
(IP地址)
ipv4
(IPv4地址)
ipv6
(IPv6地址)
maskednumber
(掩碼號碼)
password
(密碼)
time
(時間)
ordinaldirection
(方位)
prefix
(前綴)
提示格式
Identify and extract the following PII entities from the text, if present: companyname, pin, currencyname, email, phoneimei, litecoinaddress, currency, eyecolor, street, mac, state, time, vehiclevin, jobarea, date, bic, currencysymbol, currencycode, age, nearbygpscoordinate, amount, ssn, ethereumaddress, zipcode, buildingnumber, dob, firstname, middlename, ordinaldirection, jobtitle, bitcoinaddress, jobtype, phonenumber, height, password, ip, useragent, accountname, city, gender, secondaryaddress, iban, sex, prefix, ipv4, maskednumber, url, username, lastname, creditcardcvv, county, vehiclevrm, ipv6, creditcardissuer, accountnumber, creditcardnumber. Return the output in JSON format.
Greetings, Mason! Let's celebrate another year of wellness on 14/01/1977. Don't miss the event at 176,Apt. 388.
📄 許可證
本項目採用MIT許可證。