๐ ClinicalBERT - Bio + Discharge Summary BERT Model
ClinicalBERT is a specialized BERT model initialized from BioBERT and trained on discharge summaries from MIMIC. It offers valuable embeddings for clinical NLP tasks.
๐ Quick Start
Load the model via the transformers library:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_Discharge_Summary_BERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_Discharge_Summary_BERT")
โจ Features
The Publicly Available Clinical BERT Embeddings paper contains four unique clinicalBERT models. This particular model is initialized from BioBERT and trained on only discharge summaries from MIMIC.
๐ฆ Installation
No specific installation steps are provided in the original document other than the model loading code which is covered in the Quick Start section.
๐ป Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_Discharge_Summary_BERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_Discharge_Summary_BERT")
๐ Documentation
Pretraining Data
The Bio_Discharge_Summary_BERT
model was trained on all discharge summaries from MIMIC III, a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC, see here. All notes from the NOTEEVENTS
table were included (~880M words).
Model Pretraining
Note Preprocessing
Each note in MIMIC was first split into sections using a rules-based section splitter (e.g. discharge summary notes were split into "History of Present Illness", "Family History", "Brief Hospital Course", etc. sections). Then each section was split into sentences using SciSpacy (en core sci md
tokenizer).
Pretraining Procedures
The model was trained using code from Google's BERT repository on a GeForce GTX TITAN X 12 GB GPU. Model parameters were initialized with BioBERT (BioBERT-Base v1.0 + PubMed 200K + PMC 270K
).
Pretraining Hyperparameters
We used a batch size of 32, a maximum sequence length of 128, and a learning rate of 5 ยท 10โ5 for pre-training our models. The models trained on all MIMIC notes were trained for 150,000 steps. The dup factor for duplicating input data with different masks was set to 5. All other default parameters were used (specifically, masked language model probability = 0.15 and max predictions per sequence = 20).
More Information
Refer to the original paper, Publicly Available Clinical BERT Embeddings (NAACL Clinical NLP Workshop 2019) for additional details and performance on NLI and NER tasks.
Questions?
Post a Github issue on the clinicalBERT repo or email emilya@mit.edu with any questions.
๐ License
This project is licensed under the MIT license.