モデル概要

世界最大のオープンソースプライバシーデータセットでファインチューニングされ、54種類の機密情報を識別可能。AIアシスタントやLLMシナリオでのプライバシー保護に適している

モデル特徴

広範なPII識別能力

金融情報、身分証明、連絡先など54種類の機密データタイプを識別可能

効率的な軽量モデル

DistilBERTアーキテクチャを採用し、高精度を維持しながら計算リソース要件を低減

多様なシナリオへの適用性

229の議論トピックと5種類のインタラクションスタイルをカバーするトレーニングデータで、様々なテキストシナリオに適用可能

モデル能力

テキスト内の機密情報検出

個人識別情報認識

プライバシーデータ分類

多カテゴリエンティティ認識

使用事例

プライバシー保護

AIチャット記録のマスキング

チャット記録内の機密情報を自動識別・マスキング

F1スコア0.9549達成

ドキュメントプライバシー審査

GDPRなどのプライバシー規制に準拠するため、ドキュメント内の個人識別情報をスキャン

メールアドレス識別F1スコア1.0

データセキュリティ

ログ匿名化処理

システムログ内の機密データを自動削除

IPアドレス識別F1スコア0.4349

license: cc-by-nc-4.0 base_model: distilbert-base-uncased tags:

generated_from_trainer model-index:
name: distilbert_finetuned_ai4privacy_v2 results: [] datasets:
ai4privacy/pii-masking-200k
Isotonic/pii-masking-200k pipeline_tag: token-classification language:
en metrics:
seqeval

🌟 Buying me coffee is a direct way to show support for this project.

distilbert_finetuned_ai4privacy_v2

This model is a fine-tuned version of distilbert-base-uncased on the English Subset of ai4privacy/pii-masking-200k dataset.

Useage

GitHub Implementation: Ai4Privacy

Model description

This model has been finetuned on the World's largest open source privacy dataset.

The purpose of the trained models is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs.

The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion subjects / use cases split across business, education, psychology and legal fields, and 5 interactions styles (e.g. casual conversation, formal document, emails etc...).

Take a look at the Github implementation for specific reasearch.

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine_with_restarts
lr_scheduler_warmup_ratio: 0.2
num_epochs: 5

Class wise metrics

It achieves the following results on the evaluation set:

Loss: 0.0451
Overall Precision: 0.9438
Overall Recall: 0.9663
Overall F1: 0.9549
Overall Accuracy: 0.9838
Accountname F1: 0.9946
Accountnumber F1: 0.9940
Age F1: 0.9624
Amount F1: 0.9643
Bic F1: 0.9929
Bitcoinaddress F1: 0.9948
Buildingnumber F1: 0.9845
City F1: 0.9955
Companyname F1: 0.9962
County F1: 0.9877
Creditcardcvv F1: 0.9643
Creditcardissuer F1: 0.9953
Creditcardnumber F1: 0.9793
Currency F1: 0.7811
Currencycode F1: 0.8850
Currencyname F1: 0.2281
Currencysymbol F1: 0.9562
Date F1: 0.9061
Dob F1: 0.7914
Email F1: 1.0
Ethereumaddress F1: 1.0
Eyecolor F1: 0.9837
Firstname F1: 0.9846
Gender F1: 0.9971
Height F1: 0.9910
Iban F1: 0.9906
Ip F1: 0.4349
Ipv4 F1: 0.8126
Ipv6 F1: 0.7679
Jobarea F1: 0.9880
Jobtitle F1: 0.9991
Jobtype F1: 0.9777
Lastname F1: 0.9684
Litecoinaddress F1: 0.9721
Mac F1: 1.0
Maskednumber F1: 0.9635
Middlename F1: 0.9330
Nearbygpscoordinate F1: 1.0
Ordinaldirection F1: 0.9910
Password F1: 1.0
Phoneimei F1: 0.9918
Phonenumber F1: 0.9962
Pin F1: 0.9477
Prefix F1: 0.9546
Secondaryaddress F1: 0.9892
Sex F1: 0.9876
Ssn F1: 0.9976
State F1: 0.9893
Street F1: 0.9873
Time F1: 0.9889
Url F1: 1.0
Useragent F1: 0.9953
Username F1: 0.9975
Vehiclevin F1: 1.0
Vehiclevrm F1: 1.0
Zipcode F1: 0.9873

Training results

Training Loss	Epoch	Step	Validation Loss	Overall Precision	Overall Recall	Overall F1	Overall Accuracy	Accountname F1	Accountnumber F1	Age F1	Amount F1	Bic F1	Bitcoinaddress F1	Buildingnumber F1	City F1	Companyname F1	County F1	Creditcardcvv F1	Creditcardissuer F1	Creditcardnumber F1	Currency F1	Currencycode F1	Currencyname F1	Currencysymbol F1	Date F1	Dob F1	Email F1	Ethereumaddress F1	Eyecolor F1	Firstname F1	Gender F1	Height F1	Iban F1	Ip F1	Ipv4 F1	Ipv6 F1	Jobarea F1	Jobtitle F1	Jobtype F1	Lastname F1	Litecoinaddress F1	Mac F1	Maskednumber F1	Middlename F1	Nearbygpscoordinate F1	Ordinaldirection F1	Password F1	Phoneimei F1	Phonenumber F1	Pin F1	Prefix F1	Secondaryaddress F1	Sex F1	Ssn F1	State F1	Street F1	Time F1	Url F1	Useragent F1	Username F1	Vehiclevin F1	Vehiclevrm F1	Zipcode F1
0.6445	1.0	1088	0.3322	0.6449	0.7003	0.6714	0.8900	0.7607	0.8733	0.6576	0.1766	0.25	0.6783	0.3621	0.6005	0.6909	0.5586	0.0	0.2449	0.7095	0.2889	0.0	0.0	0.3902	0.7720	0.0	0.9862	0.8011	0.5088	0.7740	0.7118	0.5434	0.8088	0.0	0.8303	0.7562	0.5318	0.7294	0.4681	0.6779	0.0	0.8909	0.0	0.0107	0.9985	0.4000	0.7307	0.9057	0.8618	0.0	0.9127	0.8235	0.9211	0.8026	0.4656	0.6390	0.9383	0.9775	0.8868	0.8201	0.4526	0.0550	0.5368
0.222	2.0	2176	0.1259	0.8170	0.8747	0.8449	0.9478	0.9708	0.9813	0.7638	0.7427	0.7837	0.8908	0.8833	0.8747	0.9814	0.8749	0.7601	0.9777	0.8834	0.5372	0.4828	0.0056	0.7785	0.8149	0.3140	0.9956	0.9935	0.9101	0.9270	0.9450	0.9853	0.9253	0.0650	0.0084	0.7962	0.9013	0.9446	0.9203	0.8555	0.6885	1.0	0.7152	0.6442	1.0	0.9623	0.9349	0.9905	0.9782	0.7656	0.9324	0.9903	0.9736	0.9274	0.8520	0.9138	0.9678	0.9922	0.9893	0.9804	0.9646	0.8556	0.8385
0.1331	3.0	3264	0.0773	0.9133	0.9371	0.9250	0.9654	0.9822	0.9815	0.9196	0.8852	0.9718	0.9785	0.9215	0.9757	0.9935	0.9651	0.8742	0.9921	0.9438	0.7568	0.7710	0.0	0.8998	0.7895	0.6578	0.9994	1.0	0.9554	0.9525	0.9823	0.9910	0.9866	0.0435	0.8293	0.7824	0.9671	0.9794	0.9571	0.9447	0.9141	1.0	0.8825	0.7988	1.0	0.9797	0.9921	0.9932	0.9943	0.8726	0.9401	0.9860	0.9792	0.9928	0.9740	0.9604	0.9730	0.9983	0.9964	0.9959	0.9890	0.9774	0.9247
0.0847	4.0	4352	0.0503	0.9368	0.9614	0.9489	0.9789	0.9955	0.9949	0.9573	0.9480	0.9929	0.9846	0.9808	0.9927	0.9962	0.9811	0.9436	0.9953	0.9695	0.7826	0.8713	0.1653	0.9458	0.8782	0.7996	1.0	1.0	0.9809	0.9816	0.9941	0.9910	0.9906	0.3389	0.8364	0.7066	0.9862	1.0	0.9795	0.9637	0.9429	1.0	0.9438	0.9165	1.0	0.9864	1.0	0.9932	0.9962	0.9352	0.9483	0.9860	0.9866	0.9976	0.9884	0.9827	0.9881	1.0	0.9953	0.9975	0.9945	0.9915	0.9841
0.0557	5.0	5440	0.0451	0.9438	0.9663	0.9549	0.9838	0.9946	0.9940	0.9624	0.9643	0.9929	0.9948	0.9845	0.9955	0.9962	0.9877	0.9643	0.9953	0.9793	0.7811	0.8850	0.2281	0.9562	0.9061	0.7914	1.0	1.0	0.9837	0.9846	0.9971	0.9910	0.9906	0.4349	0.8126	0.7679	0.9880	0.9991	0.9777	0.9684	0.9721	1.0	0.9635	0.9330	1.0	0.9910	1.0	0.9918	0.9962	0.9477	0.9546	0.9892	0.9876	0.9976	0.9893	0.9873	0.9889	1.0	0.9953	0.9975	1.0	1.0	0.9873