MedBERT-512 Open-Source German Medical NLP Model - Empowering Text Processing Tasks in the Medical Field

Medbert 512

Developed by GerMedBERT

medBERT.de is a German medical natural language processing model based on the BERT architecture, specifically fine-tuned for medical texts, clinical records, and research papers, suitable for various NLP tasks in the medical field.

Large Language Model

Transformers

GermanOpen Source License:Apache-2.0 #German Medical NLP #Clinical Text Analysis #High-Precision Medical Classification

Downloads 2,110

Release Time : 11/7/2022

Model Overview

This model is designed to perform various NLP tasks in the medical field, such as medical information extraction and diagnosis prediction.

Model Features

Specialized for Medical Domain

Specifically fine-tuned for medical texts, clinical records, and research papers, proficient in various medical subfields.

Bidirectional Context Understanding

Utilizes a multi-layer bidirectional Transformer encoder, capable of capturing contextual information from both left and right directions of the input text.

Custom Tokenizer

Equipped with a custom tokenizer optimized for German medical language, better capturing rare or out-of-vocabulary words.

Anonymization Processing

All training data has been fully anonymized, with all patient-related contexts removed.

Model Capabilities

Medical Information Extraction

Diagnosis Prediction

Medical Text Classification

Clinical Record Analysis

Use Cases

Radiology Report Analysis

Chest CT Classification

Classify chest CT reports

AUROC: 96.69, Macro F1: 81.46

Chest X-ray Classification

Classify chest X-ray reports

AUROC: 84.65, Macro F1: 67.06

Medical Research

Medical Literature Analysis

Analyze medical research papers and abstracts

🚀 medBERT.de: A Comprehensive German BERT Model for the Medical Domain

medBERT.de is a German medical natural language processing model based on the BERT architecture. It is specifically fine - tuned on a large dataset of medical texts, clinical notes, research papers, and healthcare - related documents. This model is designed to perform various NLP tasks in the medical domain, such as medical information extraction and diagnosis prediction.

✨ Features

Based on the standard BERT architecture, capable of capturing rich contextual information.
Fine - tuned on a diverse medical dataset, well - versed in various medical subdomains.
Comes with a tokenizer optimized for German medical language.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

🔍 Model Details

Architecture

medBERT.de is based on the standard BERT architecture described in the original BERT paper ("BERT: Pre - training of Deep Bidirectional Transformers for Language Understanding" by Devlin et al.). It uses a multi - layer bidirectional Transformer encoder, which can capture contextual information from both left - to - right and right - to - left directions in the input text. The model has 12 layers, 768 hidden units per layer, 8 attention heads in each layer, and can process up to 512 tokens in a single input sequence.

Training Data

Property	Details
Model Type	A German medical natural language processing model based on BERT architecture
Training Data	Fine - tuned on a large dataset of medical texts, clinical notes, research papers, and healthcare - related documents. The data sources include DocCheck Flexikon, GGPOnc 1.0, Webcrawl, PubMed abstracts, radiology reports, Spinger Nature, electronic health records, doctoral theses, Thieme Publishing Group, and Wikipedia. All training data was anonymized and patient context was removed.

The following table provides an overview of the data sources used for pretraining medBERT.de:

Source	No. Documents	No. Sentences	No. Words	Size (MB)
DocCheck Flexikon	63,840	720,404	12,299,257	92
GGPOnc 1.0	4,369	66,256	1,194,345	10
Webcrawl	11,322	635,806	9,323,774	65
PubMed abstracts	12,139	108,936	1,983,752	16
Radiology reports	3,657,801	60,839,123	520,717,615	4,195
Spinger Nature	257,999	14,183,396	259,284,884	1,986
Electronic health records	373,421	4,603,461	69,639,020	440
Doctoral theses	7,486	4,665,850	90,380,880	648
Thieme Publishing Group	330,994	10,445,580	186,200,935	2,898
Wikipedia	3,639	161,714	2,799,787	22
-----------------------------	--------------	---------------	----------------	-----------
Summary	4,723,010	96,430,526	1,153,824,249	10,372

Preprocessing

The input text is preprocessed using the WordPiece tokenization technique, which breaks the text into subword units to better capture rare or out - of - vocabulary words. The case format is kept, and special characters are not removed from the text. medBERT.de comes with its own tokenizer, specifically optimized for German medical language.

📊 Performance Metrics

We fine - tuned medBERT.de on a variety of downstream tasks and compared it to other state - of - the - art BERT models in the German medical domain. Here are some exemplary results for classification tasks based on radiology reports. Please refer to our paper for more detailed results.

Model	AUROC	Macro F1	Micro F1	Precision	Recall
Chest CT
GottBERT	92.48	69.06	83.98	76.55	65.92
BioGottBERT	92.71	69.42	83.41	80.67	65.52
Multilingual BERT	91.90	66.31	80.86	68.37	65.82
German - MedBERT	92.48	66.40	81.41	72.77	62.37
medBERT.de	96.69	81.46	89.39	87.88	78.77
medBERT.de_dedup	96.39	78.77	89.24	84.29	76.01
Chest X - Ray
GottBERT	83.18	64.86	74.18	59.67	78.87
BioGottBERT	83.48	64.18	74.87	59.04	78.90
Multilingual BERT	82.43	63.23	73.92	56.67	75.33
German - MedBERT	83.22	63.13	75.39	55.66	78.03
medBERT.de	84.65	67.06	76.20	60.44	83.08
medBERT.de_dedup	84.42	66.92	76.26	60.31	82.99

⚠️ Fairness and Bias

Geographic Bias: As a significant portion of the clinical data comes from a single hospital in Berlin, Germany, the model may be biased towards the medical practices, terminology, and diseases prevalent in that region. This can lead to reduced performance and fairness when applied to other regions or countries with different healthcare systems and patient populations.
Demographic Bias: The patient population at the Berlin hospital may not be representative of the broader German or global population. Differences in age, gender, ethnicity, and socioeconomic status can cause biases in the model's predictions and understanding of certain medical conditions, symptoms, or treatments.
Specialty Bias: A large part of the training data consists of radiology reports, which could bias the model towards the language and concepts used in radiology. This may result in a less accurate understanding of other medical specialties or subdomains underrepresented in the training data.

🔒 Security and Privacy

Anonymization

All clinical data used for training the model has been thoroughly anonymized, with patient names and other personally identifiable information (PII) removed to protect patient privacy. Although some data sources, such as DocCheck, may contain names of famous physicians or individuals who gave talks recorded on the DocCheck platform, these instances are unrelated to patient data and should not pose a significant privacy risk. However, it is possible to extract these names from the model. All training data is stored securely and will not be publicly accessible. However, we will make some training data for the medical benchmarks available.

Model Security

MedBERT has been designed with security considerations in mind to minimize risks associated with adversarial attacks and information leakage. We tested the model for information leakage, and no evidence of data leakage has been found. However, as with any machine learning model, it is impossible to guarantee complete security against potential attacks.

🛠️ Limitations

Generalization: medBERT.de might struggle with medical terms or concepts not in the training dataset, especially new or rare diseases, treatments, and procedures.
Language Bias: medBERT.de is primarily trained on German - language data, and its performance may degrade significantly for non - German languages or multilingual contexts.
Misinterpretation of Context: medBERT.de may occasionally misinterpret the context of the text, leading to incorrect predictions or extracted information.
Inability to Verify Information: medBERT.de is not capable of verifying the accuracy of the information it processes, making it unsuitable for tasks where data validation is critical.
Legal and Ethical Considerations: The model should not be used to make or take part in medical decisions and should be used for research only.

📄 Terms of Use

By downloading and using the MedBERT model from the Hugging Face Hub, you agree to abide by the following terms and conditions:

Purpose and Scope: The MedBERT model is intended for research and informational purposes only and must not be used as the sole basis for making medical decisions or diagnosing patients. The model should be used as a supplementary tool alongside professional medical advice and clinical judgment.
Proper Usage: Users agree to use MedBERT in a responsible manner, complying with all applicable laws, regulations, and ethical guidelines. The model must not be used for any unlawful, harmful, or malicious purposes. The model must not be used in clinical decision - making and patient treatment.
Data Privacy and Security: Users are responsible for ensuring the privacy and security of any sensitive or confidential data processed using the MedBERT model. Personally identifiable information (PII) should be anonymized before being processed by the model, and users must implement appropriate measures to protect data privacy.
Prohibited Activities: Users are strictly prohibited from attempting to perform adversarial attacks, information retrieval, or any other actions that may compromise the security and integrity of the MedBERT model. Violators may face legal consequences and the retraction of the model's publication.

📜 Legal Disclaimer

By using medBERT.de, you agree not to engage in any attempts to perform adversarial attacks or information retrieval from the model. Such activities are strictly prohibited and constitute a violation of the terms of use. Violators may face legal consequences, and any discovered violations may result in the immediate retraction of the model's publication. By continuing to use medBERT.de, you acknowledge and accept the responsibility to adhere to these terms and conditions.

📖 Citation

@article{medbertde,
    title={MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain},
    author={Keno K. Bressem and Jens - Michalis Papaioannou and Paul Grundmann and Florian Borchert and Lisa C. Adams and Leonhard Liu and Felix Busch and Lina Xu and Jan P. Loyen and Stefan M. Niehues and Moritz Augustin and Lennart Grosser and Marcus R. Makowski and Hugo JWL. Aerts and Alexander Löser},
    journal={arXiv preprint arXiv:2303.08179},
    year={2023},
    url={https://doi.org/10.48550/arXiv.2303.08179},
    note={Keno K. Bressem and Jens - Michalis Papaioannou and Paul Grundmann contributed equally},
    subject={Computation and Language (cs.CL); Artificial Intelligence (cs.AI)},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご