EDS-Pseudo-Public Open Source Model - For Medical Document Entity Detection and Clinical Report Anonymization

Eds Pseudo Public

Developed by AP-HP

EDS-Pseudo is a hybrid model for detecting identifiable entities in medical documents, primarily used for anonymizing clinical reports in the AP-HP clinical data warehouse.

Sequence Labeling

Safetensors

Supports Multiple LanguagesOpen Source License:Bsd-3-clause #French Medical NER #Clinical Report Anonymization #Hybrid Rule-Based Model

Downloads 4,373

Release Time : 6/16/2024

Model Overview

This model combines rule-based and deep learning techniques to efficiently identify sensitive information in medical documents, such as addresses, dates, social security numbers, etc., for data anonymization purposes.

Model Features

High-Precision Anonymization

The model achieves over 97% F1-score in identifying sensitive information like addresses and dates, ensuring effective data anonymization.

Multi-Category Entity Recognition

Supports recognition of 12 types of medical-related sensitive information, including addresses, social security numbers, phone numbers, etc.

Hybrid Architecture

Combines the strengths of rule-based and deep learning approaches to achieve a good balance between precision and recall.

Model Capabilities

Medical document sensitive information identification

Data anonymization processing

Multi-category entity tagging

Clinical report analysis

Use Cases

Medical Data Privacy Protection

Clinical Report Anonymization

Automatically identifies and anonymizes patient-sensitive information in reports.

Anonymization recall rate of 98.8%

Medical Data Sharing

Research Data Preprocessing

Automatically removes identifiable information before sharing medical data.

Full anonymization accuracy of 63.1%

🚀 EDS-Pseudo

This project aims to detect identifying entities in documents, primarily tested on clinical reports in AP-HP's Clinical Data Warehouse (EDS). It provides a hybrid model (rule-based + deep learning) with rules and a training recipe, as well as tools for generating synthetic datasets.

🚀 Quick Start

This project aims at detecting identifying entities in documents, and was primarily tested on clinical reports at AP-HP's Clinical Data Warehouse (EDS).

The model is built on top of edsnlp, and consists of a hybrid model (rule-based + deep learning) for which we provide rules (eds-pseudo/pipes) and a training recipe train.py.

We also provide some fictitious templates (templates.txt) and a script to generate a synthetic dataset generate_dataset.py.

The entities that are detected are listed below.

Label	Description
`ADRESSE`	Street address, eg `33 boulevard de Picpus`
`DATE`	Any absolute date other than a birthdate
`DATE_NAISSANCE`	Birthdate
`HOPITAL`	Hospital name, eg `Hôpital Rothschild`
`IPP`	Internal AP-HP identifier for patients, displayed as a number
`MAIL`	Email address
`NDA`	Internal AP-HP identifier for visits, displayed as a number
`NOM`	Any last name (patients, doctors, third parties)
`PRENOM`	Any first name (patients, doctors, etc)
`SECU`	Social security number
`TEL`	Any phone number
`VILLE`	Any city
`ZIP`	Any zip code

📦 Installation

Downloading the public pre-trained model

The public pretrained model is available on the HuggingFace model hub at AP-HP/eds-pseudo-public and was trained on synthetic data (see generate_dataset.py). You can also test it directly on the demo.

Install the latest version of edsnlp
```
pip install "edsnlp[ml]" -U
```
Get access to the model at AP-HP/eds-pseudo-public
Create and copy a huggingface token with permission "READ" at https://huggingface.co/settings/tokens?new_token=true

import huggingface_hub

huggingface_hub.login(token=YOUR_TOKEN, new_session=False, add_to_git_credential=True)

Load the model

import edsnlp

nlp = edsnlp.load("AP-HP/eds-pseudo-public", auto_update=True)
doc = nlp(
    "En 2015, M. Charles-François-Bienvenu "
    "Myriel était évêque de Digne. C’était un vieillard "
    "d’environ soixante-quinze ans ; il occupait le "
    "siège de Digne depuis 2006."
)

for ent in doc.ents:
    print(ent, ent.label_, str(ent._.date))

To apply the model on many documents using one or more GPUs, refer to the documentation of edsnlp.

Installation to reproduce

If you'd like to reproduce eds-pseudo's training or contribute to its development, you should first clone it:

git clone https://github.com/aphp/eds-pseudo.git
cd eds-pseudo

And install the dependencies. We recommend pinning the library version in your projects, or use a strict package manager like Poetry.

📚 Documentation

Metrics

AP-HP Pseudo Test Token Scores	Precision	Recall	F1	Redact	Redact Full
ADRESSE	98.2	96.9	97.6	97.6	96.7
DATE	99	98.4	98.7	98.8	85.9
DATE_NAISSANCE	97.5	96.9	97.2	99.3	99.4
IPP	91.9	90.8	91.3	98.5	99.3
MAIL	96.1	99.8	97.9	99.8	99.7
NDA	92.1	83.5	87.6	87.4	97.2
NOM	94.4	95.3	94.8	98.2	89.5
PRENOM	93.5	96.6	95	99	93.2
SECU	88.3	100	93.8	100	100
TEL	97.5	99.9	98.7	99.9	99.6
VILLE	96.7	93.8	95.2	95.1	91.1
ZIP	96.8	100	98.3	100	100
micro	97	97.8	97.4	98.8	63.1

📄 License

This project is released under the BSD 3-Clause license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご