Deid_BERT_I2B2 Open-Source Medical Privacy Protection Model - Accurately Identify and Remove Sensitive Information in Medical Records

Deid Bert I2b2

Developed by obi

This model is used to identify and remove protected health information (PHI/PII) from medical records in compliance with HIPAA privacy standards.

Sequence Labeling

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Medical Text De-identification #HIPAA Compliance Processing #BILOU Sequence Labeling

Downloads 129.39k

Release Time : 3/2/2022

Model Overview

A sequence labeling model fine-tuned on ClinicalBERT, specifically designed for de-identification processing of electronic health records (EHR), capable of recognizing 11 categories of protected health information entities.

Model Features

HIPAA Compliance

Supports identification of 11 categories of protected health information (PHI) entities as defined by HIPAA

Clinical Context Optimization

Fine-tuned on Bio_ClinicalBERT, specifically optimized for medical text characteristics

Context Awareness

Utilizes a 32-token context window to enhance sentence boundary recognition

Model Capabilities

Medical Entity Recognition

Sensitive Information Detection

Text De-identification Processing

Sequence Labeling Prediction

Use Cases

Medical Data Privacy Protection

Electronic Health Record Anonymization

Automatically removes patient personal information before sharing medical records

See performance section for metrics like F1 score

Clinical Research Data Preparation

Cleans sensitive information from medical records for research purposes

🚀 ClinicalBERT for Medical Note De - identification

This is a fine - tuned ClinicalBERT model designed for the de - identification of medical notes, aiming to protect patients' private health information.

✨ Features

Fine - tuned ClinicalBERT: Based on the ClinicalBERT model [Alsentzer et al., 2019], it is fine - tuned for the de - identification of medical notes.
Sequence Labeling: Trained to predict protected health information (PHI/PII) entities (spans). It can classify tokens as non - PHI or one of the 11 PHI types and aggregate token predictions to spans using BILOU tagging.
HIPAA Compliance: The protected health information categories follow the regulations of [HIPAA](https://www.hhs.gov/hipaa/for - professionals/privacy/laws - regulations/index.html).
Detailed Documentation: All details about training PHI labels and usage can be found in the GitHub repo [Robust DeID](https://github.com/obi - ml - public/ehr_deidentification).

📦 Installation

The README doesn't provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

A demo on how the model works (using model predictions to de - identify a medical note) is on this space: [Medical - Note - Deidentification](https://huggingface.co/spaces/obi/Medical - Note - Deidentification).

Advanced Usage

Steps on how this model can be used to run a forward pass can be found here: [Forward Pass](https://github.com/obi - ml - public/ehr_deidentification/tree/master/steps/forward_pass). In brief:

Sentencize (the model aggregates the sentences back to the note level) and tokenize the dataset.
Use the predict function of this model to gather the predictions (i.e., predictions for each token).
Additionally, the model predictions can be used to remove PHI from the original note/text.

📚 Documentation

Model Description

A ClinicalBERT [Alsentzer et al., 2019] model fine - tuned for de - identification of medical notes.
Sequence Labeling (token classification): The model was trained to predict protected health information (PHI/PII) entities (spans). A list of protected health information categories is given by [HIPAA](https://www.hhs.gov/hipaa/for - professionals/privacy/laws - regulations/index.html).
A token can either be classified as non - PHI or as one of the 11 PHI types. Token predictions are aggregated to spans by making use of BILOU tagging.
The PHI labels that were used for training and other details can be found here: [Annotation Guidelines](https://github.com/obi - ml - public/ehr_deidentification/blob/master/AnnotationGuidelines.md)
More details on how to use this model, the format of data and other useful information is present in the GitHub repo: [Robust DeID](https://github.com/obi - ml - public/ehr_deidentification).

Dataset

The I2B2 2014 [Stubbs and Uzuner, 2015] dataset was used to train this model.

Property	Details
Model Type	ClinicalBERT fine - tuned for medical note de - identification
Training Data	I2B2 2014 dataset

	I2B2		I2B2
	TRAIN SET - 790 NOTES		TEST SET - 514 NOTES
PHI LABEL	COUNT	PERCENTAGE	COUNT	PERCENTAGE
DATE	7502	43.69	4980	44.14
STAFF	3149	18.34	2004	17.76
HOSP	1437	8.37	875	7.76
AGE	1233	7.18	764	6.77
LOC	1206	7.02	856	7.59
PATIENT	1316	7.66	879	7.79
PHONE	317	1.85	217	1.92
ID	881	5.13	625	5.54
PATORG	124	0.72	82	0.73
EMAIL	4	0.02	1	0.01
OTHERPHI	2	0.01	0	0
TOTAL	17171	100	11283	100

Training procedure

Steps on how this model was trained can be found here: [Training](https://github.com/obi - ml - public/ehr_deidentification/tree/master/steps/train). The "model_name_or_path" was set to: "emilyalsentzer/Bio_ClinicalBERT".

The dataset was sentencized with the en_core_sci_sm sentencizer from spacy.
The dataset was then tokenized with a custom tokenizer built on top of the en_core_sci_sm tokenizer from spacy.
For each sentence we added 32 tokens on the left (from previous sentences) and 32 tokens on the right (from the next sentences).
The added tokens are not used for learning - i.e, the loss is not computed on these tokens - they are used as additional context.
Each sequence contained a maximum of 128 tokens (including the 32 tokens added on). Longer sequences were split.
The sentencized and tokenized dataset with the token level labels based on the BILOU notation was used to train the model.
The model is fine - tuned from a pre - trained RoBERTa model.

Training details:

Input sequence length: 128
Batch size: 32
Optimizer: AdamW
Learning rate: 4e - 5
Dropout: 0.1

Results

The README doesn't provide specific results, so this section is skipped.

🔧 Technical Details

The model is fine - tuned from a pre - trained RoBERTa model. It uses a custom tokenizer based on spacy's en_core_sci_sm tokenizer and sentencizer. The BILOU tagging scheme is used for token classification and span aggregation. The loss is not computed on the added context tokens.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご