đ RoBERTa-based Medical Note De-identification Model
- This is a RoBERTa model fine-tuned for de-identifying medical notes, aiming to protect patients' private health information.
- It uses sequence labeling to predict protected health information entities, classifying tokens into non-PHI or one of 11 PHI types.
đ Quick Start
⨠Features
- Fine-tuned RoBERTa: Based on the RoBERTa model [Liu et al., 2019], fine-tuned for medical note de-identification.
- Sequence Labeling: Trained to predict protected health information (PHI/PII) entities (spans) through token classification.
- 11 PHI Types: Capable of classifying tokens into 11 different PHI types, with token predictions aggregated to spans using BILOU tagging.
- Annotation Guidelines: Detailed PHI labels used for training and other information can be found in the Annotation Guidelines.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
The basic usage of this model involves using it to de-identify medical notes. You can refer to the demo on Medical-Note-Deidentification for a practical demonstration.
Advanced Usage
For more details on how to use this model, the data format, and other useful information, please refer to the GitHub repo: Robust DeID.
đ Documentation
Model Description
A RoBERTa model fine-tuned for de-identification of medical notes. It uses sequence labeling (token classification) to predict protected health information (PHI/PII) entities (spans). A token can be classified as non-PHI or as one of the 11 PHI types, and token predictions are aggregated to spans using BILOU tagging. The PHI labels used for training and other details can be found in the Annotation Guidelines.
How to Use
- Demo: A demo of how the model works can be found on Medical-Note-Deidentification.
- Forward Pass: Steps for performing a forward pass with this model are available in the Forward Pass section of the GitHub repo. In brief, the steps are:
- Sentencize and tokenize the dataset.
- Use the model's predict function to gather predictions.
- Use the model predictions to remove PHI from the original note/text.
Dataset
The I2B2 2014 [Stubbs and Uzuner, 2015] dataset was used to train this model. The following table shows the distribution of PHI labels in the training and test sets:
Property |
Details |
Model Type |
A RoBERTa model fine-tuned for de-identification of medical notes. |
Training Data |
I2B2 2014 dataset |
Property |
Training Set (790 Notes) |
|
Test Set (514 Notes) |
|
PHI Label |
COUNT |
PERCENTAGE |
COUNT |
PERCENTAGE |
DATE |
7502 |
43.69 |
4980 |
44.14 |
STAFF |
3149 |
18.34 |
2004 |
17.76 |
HOSP |
1437 |
8.37 |
875 |
7.76 |
AGE |
1233 |
7.18 |
764 |
6.77 |
LOC |
1206 |
7.02 |
856 |
7.59 |
PATIENT |
1316 |
7.66 |
879 |
7.79 |
PHONE |
317 |
1.85 |
217 |
1.92 |
ID |
881 |
5.13 |
625 |
5.54 |
PATORG |
124 |
0.72 |
82 |
0.73 |
EMAIL |
4 |
0.02 |
1 |
0.01 |
OTHERPHI |
2 |
0.01 |
0 |
0 |
TOTAL |
17171 |
100 |
11283 |
100 |
Training Procedure
Steps for training this model can be found in the Training section of the GitHub repo. The "model_name_or_path" was set to: "roberta-large". The training details are as follows:
- Input Sequence Length: 128
- Batch Size: 32 (16 with 2 gradient accumulation steps)
- Optimizer: AdamW
- Learning Rate: 5e-5
- Dropout: 0.1
đ§ Technical Details
- The dataset was sentencized with the en_core_sci_sm sentencizer from spacy and then tokenized with a custom tokenizer built on top of the en_core_sci_sm tokenizer from spacy.
- For each sentence, 32 tokens were added on the left (from previous sentences) and 32 tokens on the right (from the next sentences) as additional context, but these tokens were not used for learning.
- Each sequence contained a maximum of 128 tokens (including the added tokens), and longer sequences were split.
- The sentencized and tokenized dataset with token level labels based on the BILOU notation was used to train the model.
- The model is fine-tuned from a pre-trained RoBERTa model.
đ License
This project is licensed under the MIT license.