đ Gretel GLiNER: Fine-Tuned Models for PII/PHI Detection
The Gretel GLiNER model is a fine - tuned version of the GLiNER base model knowledgator/gliner-bi-small-v1.0
. It is specifically trained for detecting Personally Identifiable Information (PII) and Protected Health Information (PHI). Gretel GLiNER offers privacy - compliant entity recognition across different industries and document types.
For more details about the base GLiNER model, including its architecture and general capabilities, refer to the GLiNER Model Card.
The model was fine - tuned on the gretelai/gretel-pii-masking-en-v1
dataset, which contains a rich and diverse collection of synthetic document snippets with PII and PHI entities.
- Training: The training split of the synthetic dataset was used.
- Validation: The validation set was used to monitor performance and adjust training parameters.
- Evaluation: The final performance was assessed on the test set, using PII/PHI entity annotations as the ground truth.
For detailed dataset statistics, including domain and entity type distributions, visit the dataset documentation on Hugging Face.
⨠Features
- Enhanced Performance: All fine - tuned Gretel GLiNER models show significant improvements over their base counterparts in accuracy, precision, recall, and F1 score.
Model |
Accuracy |
Precision |
Recall |
F1 Score |
gretelai/gretel-gliner-bi-small-v1.0 |
0.89 |
0.98 |
0.91 |
0.94 |
gretelai/gretel-gliner-bi-base-v1.0 |
0.91 |
0.98 |
0.92 |
0.95 |
gretelai/gretel-gliner-bi-large-v1.0 |
0.91 |
0.99 |
0.93 |
0.95 |
đĻ Installation
Ensure Python is installed. Then, install or update the gliner
package:
pip install gliner -U
đģ Usage Examples
Basic Usage
from gliner import GLiNER
model = GLiNER.from_pretrained("gretelai/gretel-gliner-bi-small-v1.0")
text = """
Purchase Order
----------------
Date: 10/05/2023
----------------
Customer Name: CID-982305
Billing Address: 1234 Oak Street, Suite 400, Springfield, IL, 62704
Phone: (312) 555-7890 (555-876-5432)
Email: janedoe@company.com
"""
labels = [
"medical_record_number",
"date_of_birth",
"ssn",
"date",
"first_name",
"email",
"last_name",
"customer_id",
"employee_id",
"name",
"street_address",
"phone_number",
"ipv4",
"credit_card_number",
"license_plate",
"address",
"user_name",
"device_identifier",
"bank_routing_number",
"date_time",
"company_name",
"unique_identifier",
"biometric_identifier",
"account_number",
"city",
"certificate_license_number",
"time",
"postcode",
"vehicle_identifier",
"coordinate",
"country",
"api_key",
"ipv6",
"password",
"health_plan_beneficiary_number",
"national_id",
"tax_id",
"url",
"state",
"swift_bic",
"cvv",
"pin"
]
entities = model.predict_entities(text, labels, threshold=0.7)
for entity in entities:
print(f"{entity['text']} => {entity['label']}")
Expected Output:
CID-982305 => customer_id
1234 Oak Street, Suite 400 => street_address
Springfield => city
IL => state
62704 => postcode
(312) 555-7890 => phone_number
555-876-5432 => phone_number
janedoe@company.com => email
đ Documentation
Gretel GLiNER is suitable for applications that need to detect and redact sensitive information:
- Healthcare: Automate the extraction and redaction of patient information from medical records.
- Finance: Identify and secure financial data such as account numbers and transaction details.
- Cybersecurity: Detect sensitive information in logs and security reports.
- Legal: Process contracts and legal documents to protect client information.
- Data Privacy Compliance: Ensure data - handling processes comply with regulations like GDPR and HIPAA by accurately identifying PII/PHI.
đ License
This project is licensed under the Apache - 2.0 license.
đ Citation
If you use this dataset in your research or applications, please cite it as:
@dataset{gretel-pii-masking-en-v1,
author = {Gretel AI},
title = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents},
year = {2024},
month = {10},
publisher = {Gretel},
howpublished = {https://huggingface.co/gretelai/gretel-pii-masking-en-v1}
}
For questions, issues, or additional information, visit our Synthetic Data Discord community or reach out to gretel.ai.