stanford-deidentifier免費開源系統 - 自動化實現放射學報告高精度去標識化

首頁

Stanford Deidentifier Only Radiology Reports Augmented

由StanfordAIMI開發

基於轉換器模型的放射學報告自動化去標識化系統，結合規則方法實現高精度PHI識別與替換

序列標註

Transformers

英語開源協議:MIT #放射報告去標識化 #PHI自動檢測 #生物醫學NLP

下載量 30

發布時間 : 6/9/2022

模型概述

專為放射學和生物醫學文檔設計的自動化去標識化模型，通過檢測受保護健康信息(PHI)實體並用安全替代值進行替換，滿足HIPAA隱私要求

模型特點

跨機構高性能

在已知機構放射報告上取得97.9 F1值，新機構測試達99.6，超越人工標註水平

多領域適應性

訓練數據包含6193份多機構跨領域文檔，涵蓋胸片、CT報告和普通醫療記錄

混合方法設計

結合PubMedBERT轉換器模型與'隱於尋常'規則方法，實現精準PHI檢測與替換

模型能力

放射報告PHI識別

生物醫學文本去標識化

敏感信息自動替換

跨機構文檔處理

使用案例

醫療隱私保護

胸片報告去標識化

自動識別並替換胸部X光片中的患者信息、醫生姓名和機構信息

在測試集上達到99.1%的核心PHI識別召回率

跨機構數據共享

處理來自不同醫療機構的放射學報告，實現標準化去標識化輸出

在新機構數據上取得99.6 F1值

研究數據準備

臨床研究數據脫敏

為醫學研究準備符合隱私要求的放射學數據集

支持生成符合HIPAA標準的研究用數據集

🚀 斯坦福去標識符工具

斯坦福去標識符工具在多種放射學和生物醫學文檔上進行訓練，旨在自動化去識別過程，同時達到足以用於實際生產的準確率。相關論文正在發表中。

🚀 快速開始

斯坦福去標識符工具可用於自動化放射學和生物醫學文檔的去識別過程。相關GitHub倉庫：https://github.com/MIDRC/Stanford_Penn_Deidentifier

✨ 主要特性

多類型文檔支持：可處理多種放射學和生物醫學文檔。
自動化去識別：能自動識別並處理文檔中的敏感信息。
高準確率：在多個測試集上達到了令人滿意的準確率。

📚 詳細文檔

示例文本

檢查項目：胸部X光。對比：上次檢查於2020年1月1日，還有2019年3月1日的記錄。檢查結果：片狀肺野模糊影。診斷意見：2020年1月1日的胸部X光檢查結果最令人擔憂。患者被轉至UH醫療中心的另一個科室，由佩雷斯醫生負責。我們於2020年2月1日使用MedClinical數據傳輸系統發送了數據，編號為5874233。我們收到了佩雷斯醫生的確認信息。他的聯繫電話是567 - 493 - 1234。

柯特·蘭格洛茨醫生選擇在6月23日安排一次會議。

標籤信息

屬性	詳情
模型類型	令牌分類、序列標記模型
訓練數據	radreports數據集
框架	PyTorch、Transformers
預訓練模型	PubmedBERT（無大小寫區分）
應用領域	放射學、生物醫學

📄 許可證

本項目採用MIT許可證。

📚 引用信息

如果您使用了本項目，請引用以下論文：

@article{10.1093/jamia/ocac219,
    author = {Chambon, Pierre J and Wu, Christopher and Steinkamp, Jackson M and Adleberg, Jason and Cook, Tessa S and Langlotz, Curtis P},
    title = "{Automated deidentification of radiology reports combining transformer and “hide in plain sight” rule-based methods}",
    journal = {Journal of the American Medical Informatics Association},
    year = {2022},
    month = {11},
    abstract = "{To develop an automated deidentification pipeline for radiology reports that detect protected health information (PHI) entities and replaces them with realistic surrogates “hiding in plain sight.”In this retrospective study, 999 chest X-ray and CT reports collected between November 2019 and November 2020 were annotated for PHI at the token level and combined with 3001 X-rays and 2193 medical notes previously labeled, forming a large multi-institutional and cross-domain dataset of 6193 documents. Two radiology test sets, from a known and a new institution, as well as i2b2 2006 and 2014 test sets, served as an evaluation set to estimate model performance and to compare it with previously released deidentification tools. Several PHI detection models were developed based on different training datasets, fine-tuning approaches and data augmentation techniques, and a synthetic PHI generation algorithm. These models were compared using metrics such as precision, recall and F1 score, as well as paired samples Wilcoxon tests.Our best PHI detection model achieves 97.9 F1 score on radiology reports from a known institution, 99.6 from a new institution, 99.5 on i2b2 2006, and 98.9 on i2b2 2014. On reports from a known institution, it achieves 99.1 recall of detecting the core of each PHI span.Our model outperforms all deidentifiers it was compared to on all test sets as well as human labelers on i2b2 2014 data. It enables accurate and automatic deidentification of radiology reports.A transformer-based deidentification pipeline can achieve state-of-the-art performance for deidentifying radiology reports and other medical documents.}",
    issn = {1527-974X},
    doi = {10.1093/jamia/ocac219},
    url = {https://doi.org/10.1093/jamia/ocac219},
    note = {ocac219},
    eprint = {https://academic.oup.com/jamia/advance-article-pdf/doi/10.1093/jamia/ocac219/47220191/ocac219.pdf},
}