CardioBERTa.nl_clinical Open-source Dutch Medical Model

Cardioberta.nl Clinical

Developed by UMCU

A Dutch medical domain pretrained language model developed by Amsterdam University Medical Center, specializing in medical text processing.

Large Language Model

Transformers

OtherOpen Source License:Gpl-3.0 #Dutch Medical NLP #EHR Pretraining #Multi-source Medical Corpus

Downloads 77

Release Time : 2/27/2025

Model Overview

MedRoBERTa.nl is a RoBERTa model optimized for Dutch medical texts, combining off-site and on-site pretraining with medical domain data, suitable for various medical natural language processing tasks.

Model Features

Medical Domain Specialization

Pretrained with extensive Dutch medical guidelines, electronic health records, and medical papers for domain specialization

Multi-source Data Integration

Combines native Dutch medical data with translated high-quality English medical content

Dual Training Strategy

Employs an optimized approach of off-site pretraining (200B tokens) plus on-site pretraining (7GB specialized data)

Model Capabilities

Medical Text Understanding

Clinical Record Analysis

Medical QA Systems

Medical Entity Recognition

Use Cases

Clinical Support

EHR Analysis

Automated processing and analysis of cardiovascular electronic health records

Validation perplexity 3.3

Medical Research

Medical Literature Processing

Parsing Dutch Medical Journal (NtvG) and PubMed abstracts

🚀 Continued Pre - training of MedRoBERTa.nl

This project focuses on the continued, off - premise pre - training of MedRoBERTa.nl using approximately 50GB of open Dutch and translated English corpora, followed by on - premise pre - training on a combination of 5GB of Electronic Health records and 2GB of the public set.

🚀 Quick Start

The project involves continued pre - training of the MedRoBERTa.nl model. First, off - premise pre - training is conducted with about 50GB of open Dutch and translated English corpora. Then, on - premise pre - training is carried out on a mix of 5GB of Electronic Health records and 2GB of the public set.

✨ Features

Continued pre - training of the medical model MedRoBERTa.nl.
Utilization of a large amount of Dutch and English corpora from diverse medical sources.
A combination of off - premise and on - premise pre - training strategies.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Data statistics

Sources:

Dutch: medical guidelines (FMS, NHG)
Dutch: NtvG papers
Dutch: Cardiovascular Electronic Health Records
English: Pubmed abstracts
English: PMC abstracts translated using DeepL
English: Apollo guidelines, papers and books
English: Meditron guidelines
English: MIMIC3
English: MIMIC CXR
English: MIMIC4

All data that is translated (if not with DeepL) is done using a combination of GeminiFlash 1.5/2.0/GPT4o mini, MariaNMT, NLLB200.

Number of tokens: 20B
Number of documents: 32M

Training

Effective batch size: 5120
Learning rate: 2e - 4
Weight decay: 1e - 3
Learning schedule: linear, with 5,000 warmup steps
Num epochs: ~3 (off - premise) followed by 3 (on - premise)

Perplexity:

Train perplexity: 2.4
Validation perplexity: 3.3

Acknowledgement

This work was done in collaboration with the Amsterdam UMC, within the context of the DataTools4Heart project.

The project team was fortunate to use the Google TPU research cloud for model training.

🔧 Technical Details

The project conducts continued pre - training of the MedRoBERTa.nl model. It uses a large amount of Dutch and English medical corpora from various sources. The data translation process involves multiple translation tools. The training process has specific hyperparameters such as batch size, learning rate, weight decay, and a linear learning schedule with warm - up steps. The model is trained in two phases: off - premise and on - premise, and the perplexity metrics are used to evaluate the training and validation performance.

📄 License

The project is licensed under the GPL - 3.0 license.

Property	Details
Model Type	Continued pre - trained MedRoBERTa.nl
Training Data	Dutch and English medical corpora from multiple sources, including medical guidelines, papers, and Electronic Health Records
License	GPL - 3.0
Library Name	transformers
Metrics	Perplexity
Base Model	CLTL/MedRoBERTa.nl
Tags	medical, healthcare

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご