Stanford-Deidentifier: An Open-Source Radiology Report De-identification System - Automatically Protecting the Privacy of Health Information

Stanford Deidentifier With Radiology Reports And I2b2

Developed by StanfordAIMI

Transformer-based automated de-identification system for radiology reports, achieving privacy protection by detecting Protected Health Information (PHI) and replacing it with realistic surrogate values

Sequence Labeling

Transformers

EnglishOpen Source License:MIT #Radiology Report De-identification #PHI Automatic Detection #Biomedical Text Processing

Downloads 126

Release Time : 6/9/2022

Model Overview

An automated de-identification model specifically designed for radiology and biomedical documents, combining PubMedBERT transformer with 'Hide in Plain Sight' rule-based methods to efficiently identify and replace PHI information

Model Features

Cross-institutional High Performance

Achieves 97.9/99.6 F1 scores on known/new institution test sets respectively, surpassing manual annotation levels

Hybrid Methodology

Combines PubMedBERT transformer with 'Hide in Plain Sight' rule-based methods, ensuring both recognition accuracy and replacement rationality

Multi-domain Validation

Validated on 6,193 multi-institutional cross-domain datasets (including X-ray/CT/medical records)

Model Capabilities

Protected Health Information Detection

Medical Text De-identification

Realistic Surrogate Value Generation

Radiology Report Privacy Processing

Use Cases

Medical Privacy Protection

Chest X-ray Report De-identification

Automatically identifies and replaces sensitive information (patient/doctor/institution) in chest X-ray reports

PHI core content recognition recall rate reaches 99.1%

Cross-institutional Data Sharing

Achieves anonymized transmission of medical data while preserving clinical value

Achieves 99.6 F1 score on new institution data

🚀 Stanford De - identifier

The Stanford de - identifier is trained on various radiology and biomedical documents. Its goal is to automate the de - identification process and achieve satisfactory accuracy for production use.

🚀 Quick Start

The Stanford de - identifier was trained on a wide range of radiology and biomedical documents. The aim is to automate the de - identification process and reach an accuracy suitable for production environments.

✨ Features

Token Classification: Capable of performing token - level classification.
Sequence Tagger Model: Utilizes a sequence tagger model for analysis.
Pytorch and Transformers: Built on the Pytorch framework and uses Transformers architecture.
PubmedBert: Employs the PubmedBert model, uncased.
Radiology and Biomedical Focus: Specialized for radiology and biomedical documents.

📦 Installation

No installation steps are provided in the original document.

💻 Usage Examples

No code examples are provided in the original document.

📚 Documentation

Sample Report

widget:
- text: "PROCEDURE: Chest xray. COMPARISON: last seen on 1/1/2020 and also record dated of March 1st, 2019. FINDINGS: patchy airspace opacities. IMPRESSION: The results of the chest xray of January 1 2020 are the most concerning ones. The patient was transmitted to another service of UH Medical Center under the responsability of Dr. Perez. We used the system MedClinical data transmitter and sent the data on 2/1/2020, under the ID 5874233. We received the confirmation of Dr Perez. He is reachable at 567-493-1234."
- text: "Dr. Curt Langlotz chose to schedule a meeting on 06/23."

Associated Dataset

Dataset: radreports

Associated Repo

GitHub Repo: https://github.com/MIDRC/Stanford_Penn_Deidentifier

🔧 Technical Details

The Stanford de - identifier is designed to automate the de - identification process of radiology and biomedical documents. It was trained on a large multi - institutional and cross - domain dataset, which includes various radiology reports and medical notes. By combining transformer and “hide in plain sight” rule - based methods, it can detect protected health information (PHI) entities and replace them with realistic surrogates.

📄 License

License: MIT

📚 Citation

@article{10.1093/jamia/ocac219,
    author = {Chambon, Pierre J and Wu, Christopher and Steinkamp, Jackson M and Adleberg, Jason and Cook, Tessa S and Langlotz, Curtis P},
    title = "{Automated deidentification of radiology reports combining transformer and “hide in plain sight” rule-based methods}",
    journal = {Journal of the American Medical Informatics Association},
    year = {2022},
    month = {11},
    abstract = "{To develop an automated deidentification pipeline for radiology reports that detect protected health information (PHI) entities and replaces them with realistic surrogates “hiding in plain sight.”In this retrospective study, 999 chest X-ray and CT reports collected between November 2019 and November 2020 were annotated for PHI at the token level and combined with 3001 X-rays and 2193 medical notes previously labeled, forming a large multi-institutional and cross-domain dataset of 6193 documents. Two radiology test sets, from a known and a new institution, as well as i2b2 2006 and 2014 test sets, served as an evaluation set to estimate model performance and to compare it with previously released deidentification tools. Several PHI detection models were developed based on different training datasets, fine-tuning approaches and data augmentation techniques, and a synthetic PHI generation algorithm. These models were compared using metrics such as precision, recall and F1 score, as well as paired samples Wilcoxon tests.Our best PHI detection model achieves 97.9 F1 score on radiology reports from a known institution, 99.6 from a new institution, 99.5 on i2b2 2006, and 98.9 on i2b2 2014. On reports from a known institution, it achieves 99.1 recall of detecting the core of each PHI span.Our model outperforms all deidentifiers it was compared to on all test sets as well as human labelers on i2b2 2014 data. It enables accurate and automatic deidentification of radiology reports.A transformer-based deidentification pipeline can achieve state-of-the-art performance for deidentifying radiology reports and other medical documents.}",
    issn = {1527-974X},
    doi = {10.1093/jamia/ocac219},
    url = {https://doi.org/10.1093/jamia/ocac219},
    note = {ocac219},
    eprint = {https://academic.oup.com/jamia/advance-article-pdf/doi/10.1093/jamia/ocac219/47220191/ocac219.pdf},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご