S

Starpii

Developed by bigcode
An NER model for detecting personally identifiable information (PII) in code datasets, supporting the identification of 6 types of PII including names, emails, keys, passwords, IP addresses, and usernames
Downloads 2,484
Release Time : 4/23/2023

Model Overview

This model is fine-tuned based on bigcode-encoder, specifically designed to identify and remove personally identifiable information (PII) from code data, supporting multiple programming languages

Model Features

Pseudo-Label Enhanced Training
Pre-trained on pseudo-labeled datasets first, then fine-tuned on annotated data, significantly improving the recognition performance of rare PII entities such as keys
Multi-Category PII Detection
Can identify 6 types of PII: names, emails, keys, passwords, IP addresses, and usernames
Intelligent Post-Processing
Includes multiple post-processing rules, such as ignoring short keys, incomplete names, invalid IPs, etc., to reduce false positives
Multi-Programming Language Support
Based on an encoder pre-trained on 88 programming languages and fine-tuned on PII data from 31 languages

Model Capabilities

PII detection in code
Multi-category entity recognition
Cross-language PII recognition

Use Cases

Data Privacy Protection
Code Repository PII Cleaning
Cleaning sensitive information from code repositories before training AI models
Effectively identifies and removes PII from code, reducing data leakage risks
Open-Source Project Auditing
Checking open-source code for sensitive information
Helps developers discover and remove accidentally submitted PII
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase