🚀 EDS-Pseudo
This project aims to detect identifying entities in documents, primarily tested on clinical reports in AP-HP's Clinical Data Warehouse (EDS). It provides a hybrid model (rule-based + deep learning) with rules and a training recipe, as well as tools for generating synthetic datasets.
🚀 Quick Start
This project aims at detecting identifying entities in documents, and was primarily tested on clinical reports at AP-HP's Clinical Data Warehouse (EDS).
The model is built on top of edsnlp, and consists of a hybrid model (rule-based + deep learning) for which we provide rules (eds-pseudo/pipes
) and a training recipe train.py
.
We also provide some fictitious templates (templates.txt
) and a script to generate a synthetic dataset generate_dataset.py
.
The entities that are detected are listed below.
Label |
Description |
ADRESSE |
Street address, eg 33 boulevard de Picpus |
DATE |
Any absolute date other than a birthdate |
DATE_NAISSANCE |
Birthdate |
HOPITAL |
Hospital name, eg Hôpital Rothschild |
IPP |
Internal AP-HP identifier for patients, displayed as a number |
MAIL |
Email address |
NDA |
Internal AP-HP identifier for visits, displayed as a number |
NOM |
Any last name (patients, doctors, third parties) |
PRENOM |
Any first name (patients, doctors, etc) |
SECU |
Social security number |
TEL |
Any phone number |
VILLE |
Any city |
ZIP |
Any zip code |
📦 Installation
Downloading the public pre-trained model
The public pretrained model is available on the HuggingFace model hub at AP-HP/eds-pseudo-public and was trained on synthetic data (see generate_dataset.py
). You can also test it directly on the demo.
-
Install the latest version of edsnlp
pip install "edsnlp[ml]" -U
-
Get access to the model at AP-HP/eds-pseudo-public
-
Create and copy a huggingface token with permission "READ" at https://huggingface.co/settings/tokens?new_token=true
-
Register the token (only once) on your machine
import huggingface_hub
huggingface_hub.login(token=YOUR_TOKEN, new_session=False, add_to_git_credential=True)
-
Load the model
import edsnlp
nlp = edsnlp.load("AP-HP/eds-pseudo-public", auto_update=True)
doc = nlp(
"En 2015, M. Charles-François-Bienvenu "
"Myriel était évêque de Digne. C’était un vieillard "
"d’environ soixante-quinze ans ; il occupait le "
"siège de Digne depuis 2006."
)
for ent in doc.ents:
print(ent, ent.label_, str(ent._.date))
To apply the model on many documents using one or more GPUs, refer to the documentation of edsnlp.
Installation to reproduce
If you'd like to reproduce eds-pseudo's training or contribute to its development, you should first clone it:
git clone https://github.com/aphp/eds-pseudo.git
cd eds-pseudo
And install the dependencies. We recommend pinning the library version in your projects, or use a strict package manager like Poetry.
📚 Documentation
Metrics
AP-HP Pseudo Test Token Scores |
Precision |
Recall |
F1 |
Redact |
Redact Full |
ADRESSE |
98.2 |
96.9 |
97.6 |
97.6 |
96.7 |
DATE |
99 |
98.4 |
98.7 |
98.8 |
85.9 |
DATE_NAISSANCE |
97.5 |
96.9 |
97.2 |
99.3 |
99.4 |
IPP |
91.9 |
90.8 |
91.3 |
98.5 |
99.3 |
MAIL |
96.1 |
99.8 |
97.9 |
99.8 |
99.7 |
NDA |
92.1 |
83.5 |
87.6 |
87.4 |
97.2 |
NOM |
94.4 |
95.3 |
94.8 |
98.2 |
89.5 |
PRENOM |
93.5 |
96.6 |
95 |
99 |
93.2 |
SECU |
88.3 |
100 |
93.8 |
100 |
100 |
TEL |
97.5 |
99.9 |
98.7 |
99.9 |
99.6 |
VILLE |
96.7 |
93.8 |
95.2 |
95.1 |
91.1 |
ZIP |
96.8 |
100 |
98.3 |
100 |
100 |
micro |
97 |
97.8 |
97.4 |
98.8 |
63.1 |
📄 License
This project is released under the BSD 3-Clause license.