Open-source German medical model medbert-512-no-duplicates - Supports various language processing tasks in the medical field

Medbert 512 No Duplicates

Developed by GerMedBERT

medBERT.de is a German medical natural language processing model based on the BERT architecture, specifically fine-tuned on a large corpus of medical texts, suitable for various NLP tasks in the healthcare field.

Large Language Model

Transformers

GermanOpen Source License:Apache-2.0 #German Medical NLP #Radiology Report Analysis #Clinical Text Processing

Downloads 16.71k

Release Time : 11/30/2022

Model Overview

This model is designed to perform various NLP tasks in the medical field, such as medical information extraction and diagnostic prediction, based on the standard BERT architecture, utilizing a multi-layer bidirectional Transformer encoder.

Model Features

Medical Domain Specialization

Fine-tuned on extensive medical texts, clinical records, research papers, and healthcare-related documents, proficient in various medical subfields.

High Performance

Outperforms in multiple downstream tasks, such as achieving an AUROC of 96.69 in radiology report classification, significantly surpassing other similar models.

Data Privacy Protection

All training data is fully anonymized, with all patient-related contextual information removed.

Custom Tokenizer

Equipped with a custom tokenizer optimized for German medical language, utilizing WordPiece tokenization.

Model Capabilities

Medical information extraction

Diagnostic prediction

Medical text classification

Clinical record analysis

Use Cases

Medical Diagnosis

Radiology Report Analysis

Analyze chest CT and X-ray reports to assist in diagnosis

Achieved an AUROC of 96.69 and a macro F1 of 81.46 in chest CT classification tasks

Medical Research

Medical Literature Processing

Process and analyze medical research papers and abstracts

🚀 medBERT.de: A Comprehensive German BERT Model for the Medical Domain

medBERT.de is a German medical natural language processing model based on the BERT architecture. It is specifically fine - tuned on a large dataset of medical texts, clinical notes, research papers, and healthcare - related documents. This model is designed to perform various NLP tasks in the medical domain, such as medical information extraction and diagnosis prediction.

🚀 Quick Start

medBERT.de is ready to be applied in various medical NLP tasks after fine - tuning on a large medical dataset. Users can utilize it for tasks like medical information extraction and diagnosis prediction.

✨ Features

Medical - Specific Training: Fine - tuned on a large and diverse dataset of medical texts, clinical notes, research papers, and healthcare - related documents, enabling it to handle a wide range of medical NLP tasks.
Standard BERT Architecture: Based on the standard BERT architecture, it can capture rich contextual information from input text.
German - Language Focus: Specifically optimized for the German medical language, suitable for German - speaking medical scenarios.

📦 Installation

The original README does not provide installation steps, so this section is skipped.

💻 Usage Examples

The original README does not provide code examples, so this section is skipped.

📚 Documentation

Model Details

Architecture

medBERT.de is based on the standard BERT architecture, as described in the original BERT paper ("BERT: Pre - training of Deep Bidirectional Transformers for Language Understanding" by Devlin et al.). The model employs a multi - layer bidirectional Transformer encoder, which can capture contextual information from both left - to - right and right - to - left directions in the input text. It has 12 layers, 768 hidden units per layer, 8 attention heads in each layer, and can process up to 512 tokens in a single input sequence.

Training Data

medBERT.de is fine - tuned on a large dataset of medical texts, clinical notes, research papers, and healthcare - related documents. The following table provides an overview of the data sources used for pretraining:

Property	Details
Model Type	Based on the standard BERT architecture
Training Data	Fine - tuned on a large dataset including DocCheck Flexikon, GGPOnc 1.0, Webcrawl, PubMed abstracts, Radiology reports, Spinger Nature, Electronic health records, Doctoral theses, Thieme Publishing Group, Wikipedia. All data was anonymized and patient context was removed.

Source	No. Documents	No. Sentences	No. Words	Size (MB)
DocCheck Flexikon	63,840	720,404	12,299,257	92
GGPOnc 1.0	4,369	66,256	1,194,345	10
Webcrawl	11,322	635,806	9,323,774	65
PubMed abstracts	12,139	108,936	1,983,752	16
Radiology reports	3,657,801	60,839,123	520,717,615	4,195
Spinger Nature	257,999	14,183,396	259,284,884	1,986
Electronic health records	373,421	4,603,461	69,639,020	440
Doctoral theses	7,486	4,665,850	90,380,880	648
Thieme Publishing Group	330,994	10,445,580	186,200,935	2,898
Wikipedia	3,639	161,714	2,799,787	22
-----------------------------	--------------	---------------	----------------	-----------
Summary	4,723,010	96,430,526	1,153,824,249	10,372

Preprocessing

The input text is preprocessed using the WordPiece tokenization technique, which breaks the text into subword units to better capture rare or out - of - vocabulary words. The case format is kept, and special characters are not removed from the text. medBERT.de comes with its own tokenizer, specifically optimized for German medical language.

Performance Metrics

We finetuned medBERT.de on a variety of downstream tasks and compared it to other state - of - the - art BERT models in the German medical domain. Here are some exemplary results for classification tasks, based on radiology reports. Please refer to our paper for more detailed results.

Model	AUROC	Macro F1	Micro F1	Precision	Recall
Chest CT
GottBERT	92.48	69.06	83.98	76.55	65.92
BioGottBERT	92.71	69.42	83.41	80.67	65.52
Multilingual BERT	91.90	66.31	80.86	68.37	65.82
German - MedBERT	92.48	66.40	81.41	72.77	62.37
medBERT.de	96.69	81.46	89.39	87.88	78.77
medBERT.de_dedup	96.39	78.77	89.24	84.29	76.01
Chest X - Ray
GottBERT	83.18	64.86	74.18	59.67	78.87
BioGottBERT	83.48	64.18	74.87	59.04	78.90
Multilingual BERT	82.43	63.23	73.92	56.67	75.33
German - MedBERT	83.22	63.13	75.39	55.66	78.03
medBERT.de	84.65	67.06	76.20	60.44	83.08
medBERT.de_dedup	84.42	66.92	76.26	60.31	82.99

Fairness and Bias

There are several potential biases in the training data for MedBERT, which may impact the model's performance and fairness:

Geographic Bias: As a significant portion of the clinical data comes from a single hospital in Berlin, Germany, the model may be biased towards the medical practices, terminology, and diseases prevalent in that region, leading to reduced performance and fairness in other regions or countries.
Demographic Bias: The patient population at the Berlin hospital may not represent the broader German or global population. Differences in age, gender, ethnicity, and socioeconomic status can cause biases in the model's predictions and understanding of certain medical conditions.
Specialty Bias: A large part of the training data consists of radiology reports, which may bias the model towards radiology - related language and concepts, resulting in a less accurate understanding of other medical specialties.

Security and Privacy

Anonymization

All clinical data used for training the model has been thoroughly anonymized, with patient names and other personally identifiable information (PII) removed to protect patient privacy. Although some data sources may contain names of non - patient individuals, these instances are unrelated to patient data and should not pose a significant privacy risk. However, it is possible to extract these names from the model. All training data is stored securely and will not be publicly accessible, but some training data for medical benchmarks will be made available.

Model Security

MedBERT has been designed with security considerations in mind to minimize risks associated with adversarial attacks and information leakage. We tested the model for information leakage, and no evidence of data leakage has been found. However, as with any machine learning model, it is impossible to guarantee complete security against potential attacks.

Limitations

Generalization: medBERT.de might struggle with medical terms or concepts not in the training dataset, especially new or rare diseases, treatments, and procedures.
Language Bias: medBERT.de is primarily trained on German - language data, and its performance may degrade significantly for non - German languages or multilingual contexts.
Misinterpretation of Context: medBERT.de may occasionally misinterpret the context of the text, leading to incorrect predictions or extracted information.
Inability to Verify Information: medBERT.de is not capable of verifying the accuracy of the information it processes, making it unsuitable for tasks where data validation is critical.
Legal and Ethical Considerations: The model should not be used to make or take part in medical decisions and should be used for research only.

Terms of Use

By downloading and using the MedBERT model from the Hugging Face Hub, you agree to abide by the following terms and conditions:

Purpose and Scope: The MedBERT model is intended for research and informational purposes only and must not be used as the sole basis for making medical decisions or diagnosing patients. The model should be used as a supplementary tool alongside professional medical advice and clinical judgment.
Proper Usage: Users agree to use MedBERT in a responsible manner, complying with all applicable laws, regulations, and ethical guidelines. The model must not be used for any unlawful, harmful, or malicious purposes and must not be used in clinical decision - making and patient treatment.
Data Privacy and Security: Users are responsible for ensuring the privacy and security of any sensitive or confidential data processed using the MedBERT model. Personally identifiable information (PII) should be anonymized before being processed by the model, and users must implement appropriate measures to protect data privacy.
Prohibited Activities: Users are strictly prohibited from attempting to perform adversarial attacks, information retrieval, or any other actions that may compromise the security and integrity of the MedBERT model. Violators may face legal consequences and the retraction of the model's publication.

Legal Disclaimer

By using medBERT.de, you agree not to engage in any attempts to perform adversarial attacks or information retrieval from the model. Such activities are strictly prohibited and constitute a violation of the terms of use. Violators may face legal consequences, and any discovered violations may result in the immediate retraction of the model's publication. By continuing to use medBERT.de, you acknowledge and accept the responsibility to adhere to these terms and conditions.

Citation

@article{medbertde,
    title={MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain},
    author={Keno K. Bressem and Jens - Michalis Papaioannou and Paul Grundmann and Florian Borchert and Lisa C. Adams and Leonhard Liu and Felix Busch and Lina Xu and Jan P. Loyen and Stefan M. Niehues and Moritz Augustin and Lennart Grosser and Marcus R. Makowski and Hugo JWL. Aerts and Alexander Löser},
    journal={arXiv preprint arXiv:2303.08179},
    year={2023},
    url={https://doi.org/10.48550/arXiv.2303.08179},
    note={Keno K. Bressem and Jens - Michalis Papaioannou and Paul Grundmann contributed equally},
    subject={Computation and Language (cs.CL); Artificial Intelligence (cs.AI)},
}

🔧 Technical Details

The original README contains detailed technical descriptions, so this section is covered in the "Documentation" part.

📄 License

The model is released under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご