đ ClinicalBERT for Medical Note De - identification
This is a fine - tuned ClinicalBERT model designed for the de - identification of medical notes, aiming to protect patients' private health information.
⨠Features
- Fine - tuned ClinicalBERT: Based on the ClinicalBERT model [Alsentzer et al., 2019], it is fine - tuned for the de - identification of medical notes.
- Sequence Labeling: Trained to predict protected health information (PHI/PII) entities (spans). It can classify tokens as non - PHI or one of the 11 PHI types and aggregate token predictions to spans using BILOU tagging.
- HIPAA Compliance: The protected health information categories follow the regulations of [HIPAA](https://www.hhs.gov/hipaa/for - professionals/privacy/laws - regulations/index.html).
- Detailed Documentation: All details about training PHI labels and usage can be found in the GitHub repo [Robust DeID](https://github.com/obi - ml - public/ehr_deidentification).
đĻ Installation
The README doesn't provide specific installation steps, so this section is skipped.
đģ Usage Examples
Basic Usage
A demo on how the model works (using model predictions to de - identify a medical note) is on this space: [Medical - Note - Deidentification](https://huggingface.co/spaces/obi/Medical - Note - Deidentification).
Advanced Usage
Steps on how this model can be used to run a forward pass can be found here: [Forward Pass](https://github.com/obi - ml - public/ehr_deidentification/tree/master/steps/forward_pass). In brief:
- Sentencize (the model aggregates the sentences back to the note level) and tokenize the dataset.
- Use the predict function of this model to gather the predictions (i.e., predictions for each token).
- Additionally, the model predictions can be used to remove PHI from the original note/text.
đ Documentation
Model Description
- A ClinicalBERT [Alsentzer et al., 2019] model fine - tuned for de - identification of medical notes.
- Sequence Labeling (token classification): The model was trained to predict protected health information (PHI/PII) entities (spans). A list of protected health information categories is given by [HIPAA](https://www.hhs.gov/hipaa/for - professionals/privacy/laws - regulations/index.html).
- A token can either be classified as non - PHI or as one of the 11 PHI types. Token predictions are aggregated to spans by making use of BILOU tagging.
- The PHI labels that were used for training and other details can be found here: [Annotation Guidelines](https://github.com/obi - ml - public/ehr_deidentification/blob/master/AnnotationGuidelines.md)
- More details on how to use this model, the format of data and other useful information is present in the GitHub repo: [Robust DeID](https://github.com/obi - ml - public/ehr_deidentification).
Dataset
The I2B2 2014 [Stubbs and Uzuner, 2015] dataset was used to train this model.
Property |
Details |
Model Type |
ClinicalBERT fine - tuned for medical note de - identification |
Training Data |
I2B2 2014 dataset |
|
I2B2 |
|
I2B2 |
|
|
TRAIN SET - 790 NOTES |
|
TEST SET - 514 NOTES |
|
PHI LABEL |
COUNT |
PERCENTAGE |
COUNT |
PERCENTAGE |
DATE |
7502 |
43.69 |
4980 |
44.14 |
STAFF |
3149 |
18.34 |
2004 |
17.76 |
HOSP |
1437 |
8.37 |
875 |
7.76 |
AGE |
1233 |
7.18 |
764 |
6.77 |
LOC |
1206 |
7.02 |
856 |
7.59 |
PATIENT |
1316 |
7.66 |
879 |
7.79 |
PHONE |
317 |
1.85 |
217 |
1.92 |
ID |
881 |
5.13 |
625 |
5.54 |
PATORG |
124 |
0.72 |
82 |
0.73 |
EMAIL |
4 |
0.02 |
1 |
0.01 |
OTHERPHI |
2 |
0.01 |
0 |
0 |
TOTAL |
17171 |
100 |
11283 |
100 |
Training procedure
Steps on how this model was trained can be found here: [Training](https://github.com/obi - ml - public/ehr_deidentification/tree/master/steps/train). The "model_name_or_path" was set to: "emilyalsentzer/Bio_ClinicalBERT".
- The dataset was sentencized with the en_core_sci_sm sentencizer from spacy.
- The dataset was then tokenized with a custom tokenizer built on top of the en_core_sci_sm tokenizer from spacy.
- For each sentence we added 32 tokens on the left (from previous sentences) and 32 tokens on the right (from the next sentences).
- The added tokens are not used for learning - i.e, the loss is not computed on these tokens - they are used as additional context.
- Each sequence contained a maximum of 128 tokens (including the 32 tokens added on). Longer sequences were split.
- The sentencized and tokenized dataset with the token level labels based on the BILOU notation was used to train the model.
- The model is fine - tuned from a pre - trained RoBERTa model.
Training details:
- Input sequence length: 128
- Batch size: 32
- Optimizer: AdamW
- Learning rate: 4e - 5
- Dropout: 0.1
Results
The README doesn't provide specific results, so this section is skipped.
đ§ Technical Details
The model is fine - tuned from a pre - trained RoBERTa model. It uses a custom tokenizer based on spacy's en_core_sci_sm tokenizer and sentencizer. The BILOU tagging scheme is used for token classification and span aggregation. The loss is not computed on the added context tokens.
đ License
This project is licensed under the MIT license.