🚀 Piiranha-v1: Protect your personal information!
Piiranha is a model trained to detect 17 types of Personally Identifiable Information (PII) across six languages. It can successfully catch 98.27% of PII tokens, with an overall classification accuracy of 99.44%. It's especially accurate at detecting passwords, emails (100%), phone numbers, and usernames.
🚀 Quick Start
Piiranha (under the cc-by-nc-nd-4.0 license) is ready to help you protect personal information. You can start using it by referring to the details in this README.
✨ Features
- Multilingual Support: Detects PII across six languages: English, Spanish, French, German, Italian, and Dutch.
- High Accuracy: Catches 98.27% of PII tokens with an overall classification accuracy of 99.44%.
- Comprehensive PII Detection: Identifies 17 types of PII, including account numbers, credit card numbers, and more.
📚 Documentation
Model Description
Piiranha is a fine-tuned version of microsoft/mdeberta-v3-base. The context length is 256 Deberta tokens. If your text is longer than that, just split it up.
Supported languages: English, Spanish, French, German, Italian, Dutch
Supported PII types: Account Number, Building Number, City, Credit Card Number, Date of Birth, Driver's License, Email, First Name, Last Name, ID Card, Password, Social Security Number, Street Address, Tax Number, Phone Number, Username, Zipcode.
It achieves the following results on a test set of ~73,000 sentences containing PII:
- Accuracy: 99.44%
- Loss: 0.0173
- Precision: 93.16%
- Recall: 93.08%
- F1: 93.12%
Note that the above metrics factor in the eighteen possible categories (17 PII and 1 Non PII), so the metrics are lower than the metrics for just PII vs. Non PII (binary classification).
Performance by PII type
Reported performance metrics are lower than the overall accuracy of 99.44% due to class imbalance (most tokens are not PII). However, the model is more useful than the below results suggest, due to the intent behind PII detection. The model sometimes misclassifies one PII type for another, but at the end of the day, it still recognizes the token as PII. For instance, the model often confuses first names for last names, but that's fine because it still flags the name as PII.
Entity |
Precision |
Recall |
F1-Score |
Support |
ACCOUNTNUM |
0.84 |
0.87 |
0.85 |
3575 |
BUILDINGNUM |
0.92 |
0.90 |
0.91 |
3252 |
CITY |
0.95 |
0.97 |
0.96 |
7270 |
CREDITCARDNUMBER |
0.94 |
0.96 |
0.95 |
2308 |
DATEOFBIRTH |
0.93 |
0.85 |
0.89 |
3389 |
DRIVERLICENSENUM |
0.96 |
0.96 |
0.96 |
2244 |
EMAIL |
1.00 |
1.00 |
1.00 |
6892 |
GIVENNAME |
0.87 |
0.93 |
0.90 |
12150 |
IDCARDNUM |
0.89 |
0.94 |
0.91 |
3700 |
PASSWORD |
0.98 |
0.98 |
0.98 |
2387 |
SOCIALNUM |
0.93 |
0.94 |
0.93 |
2709 |
STREET |
0.97 |
0.95 |
0.96 |
3331 |
SURNAME |
0.89 |
0.78 |
0.83 |
8267 |
TAXNUM |
0.97 |
0.89 |
0.93 |
2322 |
TELEPHONENUM |
0.99 |
1.00 |
0.99 |
5039 |
USERNAME |
0.98 |
0.98 |
0.98 |
7680 |
ZIPCODE |
0.94 |
0.97 |
0.95 |
3191 |
micro avg |
0.93 |
0.93 |
0.93 |
79706 |
macro avg |
0.94 |
0.93 |
0.93 |
79706 |
weighted avg |
0.93 |
0.93 |
0.93 |
79706 |
Intended uses & limitations
Piiranha can be used to assist with redacting PII from texts. Use at your own risk. We do not accept responsibility for any incorrect model predictions.
Training and evaluation data
The model was trained on the ai4privacy/pii-masking-400k dataset.
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 128
- eval_batch_size: 128
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.05
- num_epochs: 5
- mixed_precision_training: Native AMP
Training results
Training Loss |
Epoch |
Step |
Validation Loss |
Precision |
Recall |
F1 |
Accuracy |
0.2984 |
0.0983 |
250 |
0.1005 |
0.5446 |
0.6111 |
0.5759 |
0.9702 |
0.0568 |
0.1965 |
500 |
0.0464 |
0.7895 |
0.8459 |
0.8167 |
0.9849 |
0.0441 |
0.2948 |
750 |
0.0400 |
0.8346 |
0.8669 |
0.8504 |
0.9869 |
0.0368 |
0.3931 |
1000 |
0.0320 |
0.8531 |
0.8784 |
0.8656 |
0.9891 |
0.0323 |
0.4914 |
1250 |
0.0293 |
0.8779 |
0.8889 |
0.8834 |
0.9903 |
0.0287 |
0.5896 |
1500 |
0.0269 |
0.8919 |
0.8836 |
0.8877 |
0.9907 |
0.0282 |
0.6879 |
1750 |
0.0276 |
0.8724 |
0.9012 |
0.8866 |
0.9903 |
0.0268 |
0.7862 |
2000 |
0.0254 |
0.8890 |
0.9041 |
0.8965 |
0.9914 |
0.0264 |
0.8844 |
2250 |
0.0236 |
0.8886 |
0.9040 |
0.8962 |
0.9915 |
0.0243 |
0.9827 |
2500 |
0.0232 |
0.8998 |
0.9033 |
0.9015 |
0.9917 |
0.0213 |
1.0810 |
2750 |
0.0237 |
0.9115 |
0.9040 |
0.9077 |
0.9923 |
0.0213 |
1.1792 |
3000 |
0.0222 |
0.9123 |
0.9143 |
0.9133 |
0.9925 |
0.0217 |
1.2775 |
3250 |
0.0222 |
0.8999 |
0.9169 |
0.9083 |
0.9924 |
0.0209 |
1.3758 |
3500 |
0.0212 |
0.9111 |
0.9133 |
0.9122 |
0.9928 |
0.0204 |
1.4741 |
3750 |
0.0206 |
0.9054 |
0.9203 |
0.9128 |
0.9926 |
0.0183 |
1.5723 |
4000 |
0.0212 |
0.9126 |
0.9160 |
0.9143 |
0.9927 |
0.0191 |
1.6706 |
4250 |
0.0192 |
0.9122 |
0.9192 |
0.9157 |
0.9929 |
0.0185 |
1.7689 |
4500 |
0.0195 |
0.9200 |
0.9191 |
0.9196 |
0.9932 |
0.018 |
1.8671 |
4750 |
0.0188 |
0.9136 |
0.9215 |
0.9176 |
0.9933 |
0.0183 |
1.9654 |
5000 |
0.0191 |
0.9179 |
0.9212 |
0.9196 |
0.9934 |
0.0147 |
2.0637 |
5250 |
0.0188 |
0.9246 |
0.9242 |
0.9244 |
0.9937 |
0.0149 |
2.1619 |
5500 |
0.0184 |
0.9188 |
0.9254 |
0.9221 |
0.9937 |
0.0143 |
2.2602 |
5750 |
0.0193 |
0.9187 |
0.9224 |
0.9205 |
0.9932 |
0.014 |
2.3585 |
6000 |
0.0190 |
0.9246 |
0.9280 |
0.9263 |
0.9936 |
0.0146 |
2.4568 |
6250 |
0.0190 |
0.9225 |
0.9277 |
0.9251 |
0.9936 |
0.0148 |
2.5550 |
6500 |
0.0175 |
0.9297 |
0.9306 |
0.9301 |
0.9942 |
0.0136 |
2.6533 |
6750 |
0.0172 |
0.9191 |
0.9329 |
0.9259 |
0.9938 |
0.0137 |
2.7516 |
7000 |
0.0166 |
0.9299 |
0.9312 |
0.9306 |
0.9942 |
0.014 |
2.8498 |
7250 |
0.0167 |
0.9285 |
0.9313 |
0.9299 |
0.9942 |
0.0128 |
2.9481 |
7500 |
0.0166 |
0.9271 |
0.9326 |
0.9298 |
0.9943 |
0.0113 |
3.0464 |
7750 |
0.0171 |
0.9286 |
0.9347 |
0.9316 |
0.9946 |
0.0103 |
3.1447 |
8000 |
0.0172 |
0.9284 |
0.9383 |
0.9334 |
0.9945 |
0.0104 |
3.2429 |
8250 |
0.0169 |
0.9312 |
0.9406 |
0.9359 |
0.9947 |
0.0094 |
3.3412 |
8500 |
0.0166 |
0.9368 |
0.9359 |
0.9364 |
0.9948 |
0.01 |
3.4395 |
8750 |
0.0166 |
0.9289 |
0.9387 |
0.9337 |
0.9944 |
0.0099 |
3.5377 |
9000 |
0.0162 |
0.9335 |
0.9332 |
0.9334 |
0.9947 |
0.0099 |
3.6360 |
9250 |
0.0160 |
0.9321 |
0.9380 |
0.9350 |
0.9947 |
0.01 |
3.7343 |
9500 |
0.0168 |
0.9306 |
0.9389 |
0.9347 |
0.9947 |
0.0101 |
3.8325 |
9750 |
0.0159 |
0.9339 |
0.9350 |
0.9344 |
0.9947 |
Contact
william (at) integrinet [dot] org
Framework versions
- Transformers 4.44.2
- Pytorch 2.4.1+cu121
- Datasets 3.0.0
- Tokenizers 0.19.1
⚠️ Important Note
Piiranha can be used to assist with redacting PII from texts. Use at your own risk. We do not accept responsibility for any incorrect model predictions.
Piiranha was trained on H100 GPUs generously sponsored by the Akash Network
Property |
Details |
Model Type |
Piiranha-v1 |
Training Data |
ai4privacy/pii-masking-400k |
License |
cc-by-nc-nd-4.0 |
Base Model |
microsoft/mdeberta-v3-base |
Pipeline Tag |
token-classification |
Supported Languages |
English, Spanish, French, German, Italian, Dutch |
Supported PII Types |
Account Number, Building Number, City, Credit Card Number, Date of Birth, Driver's License, Email, First Name, Last Name, ID Card, Password, Social Security Number, Street Address, Tax Number, Phone Number, Username, Zipcode |