đ Gretel GLiNER: Fine-Tuned Models for PII/PHI Detection
Gretel GLiNER is a fine - tuned model for detecting Personally Identifiable Information (PII) and Protected Health Information (PHI), offering privacy - compliant entity recognition across various industries.
đ Quick Start
Gretel GLiNER is a fine - tuned version of the GLiNER base model knowledgator/gliner-bi-large-v1.0
. It's specifically designed for detecting PII and PHI, providing privacy - compliant entity recognition in different industries and document types. For more details about the base GLiNER model, refer to the GLiNER Model Card.
The model was fine - tuned on the gretelai/gretel-pii-masking-en-v1
dataset, which contains a rich and diverse set of synthetic document snippets with PII and PHI entities. The training process included using the training split of the synthetic dataset, monitoring performance on the validation set to adjust parameters, and evaluating the final performance on the test set with PII/PHI entity annotations as the ground truth. For detailed dataset statistics, visit the dataset documentation on Hugging Face.
⨠Features
- High - performance Detection: All fine - tuned Gretel GLiNER models show significant improvements in accuracy, precision, recall, and F1 score compared to their base counterparts.
- Diverse Use Cases: Ideal for various industries such as healthcare, finance, cybersecurity, legal, and for ensuring data privacy compliance.
đĻ Installation
Ensure Python is installed. Then, install or update the gliner
package:
pip install gliner -U
đģ Usage Examples
Basic Usage
from gliner import GLiNER
model = GLiNER.from_pretrained("gretelai/gretel-gliner-bi-large-v1.0")
text = """
Purchase Order
----------------
Date: 10/05/2023
----------------
Customer Name: CID-982305
Billing Address: 1234 Oak Street, Suite 400, Springfield, IL, 62704
Phone: (312) 555-7890 (555-876-5432)
Email: janedoe@company.com
"""
labels = [
"medical_record_number",
"date_of_birth",
"ssn",
"date",
"first_name",
"email",
"last_name",
"customer_id",
"employee_id",
"name",
"street_address",
"phone_number",
"ipv4",
"credit_card_number",
"license_plate",
"address",
"user_name",
"device_identifier",
"bank_routing_number",
"date_time",
"company_name",
"unique_identifier",
"biometric_identifier",
"account_number",
"city",
"certificate_license_number",
"time",
"postcode",
"vehicle_identifier",
"coordinate",
"country",
"api_key",
"ipv6",
"password",
"health_plan_beneficiary_number",
"national_id",
"tax_id",
"url",
"state",
"swift_bic",
"cvv",
"pin"
]
entities = model.predict_entities(text, labels, threshold=0.7)
for entity in entities:
print(f"{entity['text']} => {entity['label']}")
Expected Output:
CID-982305 => customer_id
1234 Oak Street, Suite 400 => street_address
Springfield => city
IL => state
62704 => postcode
(312) 555-7890 => phone_number
555-876-5432 => phone_number
janedoe@company.com => email
đ Documentation
Model Performance
All fine - tuned Gretel GLiNER models demonstrate substantial improvements over their base counterparts in accuracy, precision, recall, and F1 score:
Model |
Accuracy |
Precision |
Recall |
F1 Score |
gretelai/gretel-gliner-bi-small-v1.0 |
0.89 |
0.98 |
0.91 |
0.94 |
gretelai/gretel-gliner-bi-base-v1.0 |
0.91 |
0.98 |
0.92 |
0.95 |
gretelai/gretel-gliner-bi-large-v1.0 |
0.91 |
0.99 |
0.93 |
0.95 |
Use Cases
Gretel GLiNER is suitable for applications that require the detection and redaction of sensitive information:
- Healthcare: Automate the extraction and redaction of patient information from medical records.
- Finance: Identify and secure financial data such as account numbers and transaction details.
- Cybersecurity: Detect sensitive information in logs and security reports.
- Legal: Process contracts and legal documents to protect client information.
- Data Privacy Compliance: Ensure data handling processes comply with regulations like GDPR and HIPAA by accurately identifying PII/PHI.
đ License
This project is licensed under the Apache - 2.0 license.
đ Citation
If you use this dataset in your research or applications, please cite it as:
@dataset{gretel-pii-masking-en-v1,
author = {Gretel AI},
title = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents},
year = {2024},
month = {10},
publisher = {Gretel},
howpublished = {https://huggingface.co/gretelai/gretel-pii-masking-en-v1}
}
For questions, issues, or additional information, please visit our Synthetic Data Discord community or reach out to gretel.ai.