đ Continued Pre - training of MedRoBERTa.nl
This project focuses on the continued, off - premise pre - training of MedRoBERTa.nl using approximately 50GB of open Dutch and translated English corpora, followed by on - premise pre - training on a combination of 5GB of Electronic Health records and 2GB of the public set.
đ Quick Start
The project involves continued pre - training of the MedRoBERTa.nl model. First, off - premise pre - training is conducted with about 50GB of open Dutch and translated English corpora. Then, on - premise pre - training is carried out on a mix of 5GB of Electronic Health records and 2GB of the public set.
⨠Features
- Continued pre - training of the medical model MedRoBERTa.nl.
- Utilization of a large amount of Dutch and English corpora from diverse medical sources.
- A combination of off - premise and on - premise pre - training strategies.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
No code examples are provided in the original document, so this section is skipped.
đ Documentation
Data statistics
Sources:
- Dutch: medical guidelines (FMS, NHG)
- Dutch: NtvG papers
- Dutch: Cardiovascular Electronic Health Records
- English: Pubmed abstracts
- English: PMC abstracts translated using DeepL
- English: Apollo guidelines, papers and books
- English: Meditron guidelines
- English: MIMIC3
- English: MIMIC CXR
- English: MIMIC4
All data that is translated (if not with DeepL) is done using a combination of GeminiFlash 1.5/2.0/GPT4o mini, MariaNMT, NLLB200.
- Number of tokens: 20B
- Number of documents: 32M
Training
- Effective batch size: 5120
- Learning rate: 2e - 4
- Weight decay: 1e - 3
- Learning schedule: linear, with 5,000 warmup steps
- Num epochs: ~3 (off - premise) followed by 3 (on - premise)
Perplexity:
- Train perplexity: 2.4
- Validation perplexity: 3.3
Acknowledgement
This work was done in collaboration with the Amsterdam UMC, within the context of the DataTools4Heart project.
The project team was fortunate to use the Google TPU research cloud for model training.
đ§ Technical Details
The project conducts continued pre - training of the MedRoBERTa.nl model. It uses a large amount of Dutch and English medical corpora from various sources. The data translation process involves multiple translation tools. The training process has specific hyperparameters such as batch size, learning rate, weight decay, and a linear learning schedule with warm - up steps. The model is trained in two phases: off - premise and on - premise, and the perplexity metrics are used to evaluate the training and validation performance.
đ License
The project is licensed under the GPL - 3.0 license.
Property |
Details |
Model Type |
Continued pre - trained MedRoBERTa.nl |
Training Data |
Dutch and English medical corpora from multiple sources, including medical guidelines, papers, and Electronic Health Records |
License |
GPL - 3.0 |
Library Name |
transformers |
Metrics |
Perplexity |
Base Model |
CLTL/MedRoBERTa.nl |
Tags |
medical, healthcare |