Open-source MS-BERT Model - Designed Specifically for Medical Text Analysis to Aid in the Interpretation of Multiple Sclerosis Records

Ms Bert

Developed by NLP4H

MS-BERT is a BERT model pretrained on neurological examination records of multiple sclerosis (MS) patients, specifically designed for medical text analysis.

Large Language Model #Multiple Sclerosis Analysis #Clinical Record Pretraining #Specialized for Neurological Examinations

Downloads 26

Release Time : 3/2/2022

Model Overview

This model is based on the BLUE-BERT-base architecture and was further pretrained on approximately 75,000 clinical records, focusing on text comprehension tasks related to neurological examinations.

Model Features

Medical Domain Specialization

Domain-adapted training based on real neurological examination records of multiple sclerosis patients

Strict Data Privacy Protection

All clinical records underwent systematic de-identification, complying with medical data privacy standards

Gender Ratio Representation

Training data gender distribution (72% female/28% male) aligns with natural MS incidence rates

Model Capabilities

Clinical Text Understanding

Medical Entity Recognition

Neurological Examination Record Analysis

Use Cases

Clinical Research Support

MS Disease Course Analysis

Extracting disease progression features from neurological examination records

Medical Document Processing

De-identified Text Analysis

Processing anonymized clinical records for natural language understanding

🚀 MS-BERT

This repository offers the codes and models of MS-BERT, a model pre-trained on neurological examination notes of Multiple Sclerosis (MS) patients.

🚀 Quick Start

This repository provides codes and models of MS-BERT. MS-BERT was pre-trained on notes from neurological examination for Multiple Sclerosis (MS) patients at St. Michael's Hospital in Toronto, Canada.

✨ Features

Pre-trained on a large dataset of clinical notes from MS patients, which contains rich and clinically relevant neurological information.
Based on the pre-trained and fine - tuned BLUE - BERT base, further pre - trained using masked language modelling task.

📚 Documentation

📦 Data

The dataset contained approximately 75,000 clinical notes, for about 5000 patients, totaling to over 35.7 million words. These notes were collected from patients who visited St. Michael's Hospital MS Clinic between 2015 to 2019. The notes contained a variety of information pertaining to a neurological exam. For example, a note can contain information on the patient's condition, their progress over time and diagnosis. The gender split within the dataset was observed to be 72% female and 28% male ([which reflects the natural discrepancy seen in MS][1]).

📦 Data pre - processing

The data was pre - processed to remove any identifying information. This includes information on: patient names, doctor names, hospital names, patient identification numbers, phone numbers, addresses, and time. In order to de - identify the information, we used a curated database that contained patient and doctor information. This curated database was paired with regular expressions to find and remove any identifying pieces of information. Each of these identifiers were replaced with a specific token. These tokens were chosen based on three criteria: (1) they belong to the current BERT vocab, (2), they have relatively the same semantic meaning as the word they are replacing, and (3), the token is not found in the original unprocessed dataset. The replacements that met the criteria above were as follows:

Property	Details
Female first names	Lucie
Male first names	Ezekiel
Last/family names	Salamanca
Dates	2010s
Patient IDs	999
Phone numbers	1718
Addresses	Silesia
Time	1610
Locations/Hospital/Clinic names	Troy

📦 Pre - training

The starting point for our model is the already pre - trained and fine - tuned BLUE - BERT base. We further pre - train it using the masked language modelling task from the huggingface transformers library.

The hyperparameters can be found in the config file in this repository or here

📄 Acknowledgements

We would like to thank the researchers and staff at the Data Science and Advanced Analytics (DSAA) department, St. Michael’s Hospital, for providing consistent support and guidance throughout this project. We would also like to thank Dr. Marzyeh Ghassemi, Taylor Killan, Nathan Ng and Haoran Zhang for providing us the opportunity to work on this exciting project.

📄 Disclaimer

⚠️ Important Note

MS - BERT shows the results of research conducted at the Data Science and Advanced Analytics (DSAA) department, St. Michael’s Hospital. The results produced by MS - BERT are not intended for direct diagnostic use or medical decision - making without review and oversight by a clinical professional. Individuals should not make decisions about their health solely on the basis of the results produced by MS - BERT. St. Michael’s Hospital does not independently verify the validity or utility of the results produced by MS - BERT. If you have questions about the results produced by MS - BERT please consult a healthcare professional. If you would like more information about the research conducted at DSAA please contact Zhen Yang. If you would like more information on neurological examination notes please contact Dr. Tony Antoniou or Dr. Jiwon Oh from the MS clinic at St. Michael's Hospital.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご