Starpii Open Source Code PII Detection Model - Free Identification of 6 Types of Personal Information Such as Names and Emails!

Starpii

Developed by bigcode

An NER model for detecting personally identifiable information (PII) in code datasets, supporting the identification of 6 types of PII including names, emails, keys, passwords, IP addresses, and usernames

Sequence Labeling

Transformers

Other#Code PII Detection #Multi-Programming Language Support #Pseudo-Label Enhancement

Downloads 2,484

Release Time : 4/23/2023

Model Overview

This model is fine-tuned based on bigcode-encoder, specifically designed to identify and remove personally identifiable information (PII) from code data, supporting multiple programming languages

Model Features

Pseudo-Label Enhanced Training

Pre-trained on pseudo-labeled datasets first, then fine-tuned on annotated data, significantly improving the recognition performance of rare PII entities such as keys

Multi-Category PII Detection

Can identify 6 types of PII: names, emails, keys, passwords, IP addresses, and usernames

Intelligent Post-Processing

Includes multiple post-processing rules, such as ignoring short keys, incomplete names, invalid IPs, etc., to reduce false positives

Multi-Programming Language Support

Based on an encoder pre-trained on 88 programming languages and fine-tuned on PII data from 31 languages

Model Capabilities

PII detection in code

Multi-category entity recognition

Cross-language PII recognition

Use Cases

Data Privacy Protection

Code Repository PII Cleaning

Cleaning sensitive information from code repositories before training AI models

Effectively identifies and removes PII from code, reducing data leakage risks

Open-Source Project Auditing

Checking open-source code for sensitive information

Helps developers discover and remove accidentally submitted PII

🚀 StarPII

This is an NER model designed to detect Personal Identifiable Information (PII) in code datasets. It fine - tunes a pre - trained encoder to accurately identify various types of PII, offering a reliable solution for data privacy in the coding realm.

🚀 Quick Start

Before using the model, please read and agree to the Terms of Use. You can access the necessary datasets and start the PII detection process as described in the model description.

✨ Features

Fine - tuned on Custom Dataset: The model is fine - tuned on a self - annotated PII dataset, enhancing its accuracy in detecting PII in code.
Multiple Target Classes: Capable of detecting 6 target classes, including Names, Emails, Keys, Passwords, IP addresses, and Usernames.
Pseudo - labeled Training: Initial training on a pseudo - labeled dataset improves performance on rare PII entities.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

📚 Documentation

Model description

This is an NER model trained to detect Personal Identifiable Information (PII) in code datasets. We fine - tuned [bigcode - encoder](https://huggingface.co/bigcode/bigcode - encoder) on a PII dataset we annotated, available with gated access at [bigcode - pii - dataset](https://huggingface.co/datasets/bigcode/pii - annotated - toloka - donwsample - emails) (see [bigcode - pii - dataset - training](https://huggingface.co/datasets/bigcode/bigcode - pii - dataset - training) for the exact data splits). We added a linear layer as a token classification head on top of the encoder model, with 6 target classes: Names, Emails, Keys, Passwords, IP addresses and Usernames.

Dataset

Fine - tuning on the annotated dataset

The finetuning dataset contains 20961 secrets and 31 programming languages, but the base encoder model was pre - trained on 88 programming languages from [The Stack](https://huggingface.co/datasets/bigcode/the - stack) dataset.

Initial training on a pseudo - labeled dataset

To enhance model performance on some rare PII entities like keys, we initially trained on a pseudo - labeled dataset before fine - tuning on the annotated dataset. The method involves training a model on a small set of labeled data and subsequently generating predictions for a larger set of unlabeled data.

Specifically, we annotated 18,000 files available at [bigcode - pii - ppseudo - labeled](https://huggingface.co/datasets/bigcode/pseudo - labeled - python - data - pii - detection - filtered) using an ensemble of two encoder models [Deberta - v3 - large](https://huggingface.co/microsoft/deberta - v3 - large) and [stanford - deidentifier - base](StanfordAIMI/stanford - deidentifier - base) which were fine - tuned on an intern previously labeled PII [dataset](https://huggingface.co/datasets/bigcode/pii - for - code) for code with 400 files from this work. To select good - quality pseudo - labels, we computed the average probability logits between the models and filtered based on a minimum score. After inspection, we observed a high rate of false positives for Keys and Passwords, hence we retained only the entities that had a trigger word like key, auth and pwd in the surrounding context. Training on this synthetic dataset prior to fine - tuning on the annotated one yielded superior results for all PII categories, as demonstrated in the table in the following section.

Performance

This model is represented in the last row (NER + pseudo labels )

Emails, IP addresses and Keys

Property	Details
Model Type	NER model for PII detection in code datasets
Training Data	Annotated PII dataset and pseudo - labeled dataset

Method	Email address			IP address			Key
	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1
Regex	69.8%	98.8%	81.8%	65.9%	78%	71.7%	2.8%	46.9%	5.3%
NER	94.01%	98.10%	96.01%	88.95%	94.43%	91.61%	60.37%	53.38%	56.66%
+ pseudo labels	97.73%	98.94%	98.15%	90.10%	93.86%	91.94%	62.38%	80.81%	70.41%

Names, Usernames and Passwords

Method	Name			Username			Password
	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1
NER	83.66%	95.52%	89.19%	48.93%	75.55%	59.39%	59.16%	96.62%	73.39%
+ pseudo labels	86.45%	97.38%	91.59%	52.20%	74.81%	61.49%	70.94%	95.96%	81.57%

We used this model to mask PII in the bigcode large model training. We dropped usernames since they resulted in many false positives and negatives. For the other PII types, we added the following post - processing that we recommend for future uses of the model (the code is also available on GitHub):

Ignore secrets with less than 4 characters.
Detect full names only.
Ignore detected keys with less than 9 characters or that are not gibberish using a [gibberish - detector](https://github.com/domanchi/gibberish - detector).
Ignore IP addresses that aren't valid or are private (non - internet facing) using the ipaddress python package. We also ignore IP addresses from popular DNS servers. We use the same list as in this paper.

Considerations for Using the Model

⚠️ Important Note

While using this model, please be aware that there may be potential risks associated with its application. There is a possibility of false positives and negatives, which could lead to unintended consequences when processing sensitive data. Moreover, the model's performance may vary across different data types and programming languages, necessitating validation and fine - tuning for specific use cases.

💡 Usage Tip

Researchers and developers are expected to uphold ethical standards and data protection measures when using the model. By making it openly accessible, our aim is to encourage the development of privacy - preserving AI technologies while remaining vigilant of potential risks associated with PII.

Terms of Use for the model

This is an NER model trained to detect Personal Identifiable Information (PII) in code datasets. We ask that you read and agree to the following Terms of Use before using the model:

You agree that you will not use the model for any purpose other than PII detection for the purpose of removing PII from datasets.
You agree that you will not share the model or any modified versions for whatever purpose.
Unless required by applicable law or agreed to in writing, the model is provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON - INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using the model, and assume any risks associated with your exercise of permissions under these Terms of Use.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE MODEL OR THE USE OR OTHER DEALINGS IN THE MODEL.

Additional Access Information

Field	Type
Email	text
I have read the License and agree with its terms	checkbox

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご