Deid_roberta_i2b2 Open-source Medical Information Processing Model - Free Identification and Removal of Sensitive Information in Medical Records

Deid Roberta I2b2

Developed by obi

This model is a sequence labeling model fine-tuned on RoBERTa, designed to identify and remove Protected Health Information (PHI/PII) from medical records.

Sequence Labeling

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Medical Text De-identification #HIPAA Compliance Processing #RoBERTa Fine-tuning

Downloads 1.1M

Release Time : 3/2/2022

Model Overview

This model is specifically designed for de-identification of Electronic Health Records (EHR), capable of recognizing and classifying 11 types of Protected Health Information entities as defined by HIPAA, including dates, medical personnel, hospitals, ages, etc.

Model Features

HIPAA Compliance

Strictly adheres to the 11 categories of PHI identification standards defined by HIPAA regulations

Context Awareness

Adds 32 tokens of contextual information before and after each sentence to improve recognition accuracy

BILOU Annotation

Uses the BILOU annotation scheme to aggregate token-level predictions into complete entity segments

Model Capabilities

Medical Text Analysis

Sensitive Information Identification

Entity Classification

Text De-identification

Use Cases

Medical Data Privacy Protection

Electronic Health Record Anonymization

Automatically removes patient personal information before sharing medical records

F1 score meets industry standards

Clinical Research Data Preparation

Prepares de-identified patient data for research purposes

Preserves clinical value while protecting patient privacy

🚀 RoBERTa-based Medical Note De-identification Model

This is a RoBERTa model fine-tuned for de-identifying medical notes, aiming to protect patients' private health information.
It uses sequence labeling to predict protected health information entities, classifying tokens into non-PHI or one of 11 PHI types.

🚀 Quick Start

A demo of the model's operation can be found on this space: Medical-Note-Deidentification.
Steps for performing a forward pass with this model are available here: Forward Pass.

✨ Features

Fine-tuned RoBERTa: Based on the RoBERTa model [Liu et al., 2019], fine-tuned for medical note de-identification.
Sequence Labeling: Trained to predict protected health information (PHI/PII) entities (spans) through token classification.
11 PHI Types: Capable of classifying tokens into 11 different PHI types, with token predictions aggregated to spans using BILOU tagging.
Annotation Guidelines: Detailed PHI labels used for training and other information can be found in the Annotation Guidelines.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

The basic usage of this model involves using it to de-identify medical notes. You can refer to the demo on Medical-Note-Deidentification for a practical demonstration.

Advanced Usage

For more details on how to use this model, the data format, and other useful information, please refer to the GitHub repo: Robust DeID.

📚 Documentation

Model Description

A RoBERTa model fine-tuned for de-identification of medical notes. It uses sequence labeling (token classification) to predict protected health information (PHI/PII) entities (spans). A token can be classified as non-PHI or as one of the 11 PHI types, and token predictions are aggregated to spans using BILOU tagging. The PHI labels used for training and other details can be found in the Annotation Guidelines.

How to Use

Demo: A demo of how the model works can be found on Medical-Note-Deidentification.
Forward Pass: Steps for performing a forward pass with this model are available in the Forward Pass section of the GitHub repo. In brief, the steps are:
- Sentencize and tokenize the dataset.
- Use the model's predict function to gather predictions.
- Use the model predictions to remove PHI from the original note/text.

Dataset

The I2B2 2014 [Stubbs and Uzuner, 2015] dataset was used to train this model. The following table shows the distribution of PHI labels in the training and test sets:

Property	Details
Model Type	A RoBERTa model fine-tuned for de-identification of medical notes.
Training Data	I2B2 2014 dataset

Property	Training Set (790 Notes)		Test Set (514 Notes)
PHI Label	COUNT	PERCENTAGE	COUNT	PERCENTAGE
DATE	7502	43.69	4980	44.14
STAFF	3149	18.34	2004	17.76
HOSP	1437	8.37	875	7.76
AGE	1233	7.18	764	6.77
LOC	1206	7.02	856	7.59
PATIENT	1316	7.66	879	7.79
PHONE	317	1.85	217	1.92
ID	881	5.13	625	5.54
PATORG	124	0.72	82	0.73
EMAIL	4	0.02	1	0.01
OTHERPHI	2	0.01	0	0
TOTAL	17171	100	11283	100

Training Procedure

Steps for training this model can be found in the Training section of the GitHub repo. The "model_name_or_path" was set to: "roberta-large". The training details are as follows:

Input Sequence Length: 128
Batch Size: 32 (16 with 2 gradient accumulation steps)
Optimizer: AdamW
Learning Rate: 5e-5
Dropout: 0.1

🔧 Technical Details

The dataset was sentencized with the en_core_sci_sm sentencizer from spacy and then tokenized with a custom tokenizer built on top of the en_core_sci_sm tokenizer from spacy.
For each sentence, 32 tokens were added on the left (from previous sentences) and 32 tokens on the right (from the next sentences) as additional context, but these tokens were not used for learning.
Each sequence contained a maximum of 128 tokens (including the added tokens), and longer sequences were split.
The sentencized and tokenized dataset with token level labels based on the BILOU notation was used to train the model.
The model is fine-tuned from a pre-trained RoBERTa model.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご