BioClinical-ModernBERT-base Open-source Model - Effortlessly Handle Clinical NLP Tasks with Biomedical Long Contexts

Bioclinical ModernBERT Base

Developed by thomas-sounack

BioClinical ModernBERT is a biomedical and clinical natural language processing model built on ModernBERT, with the ability to process long contexts and performs excellently in biomedical and clinical NLP tasks.

Large Language Model

Transformers

EnglishOpen Source License:MIT #Long context clinical text processing #Biomedical NLP #Multi-source clinical data training

Downloads 115

Release Time : 5/7/2025

Model Overview

BioClinical ModernBERT is a domain-adaptive encoder built on ModernBERT, incorporating the ability to process long contexts and significantly improving the speed and performance of biomedical and clinical natural language processing.

Model Features

Long context processing ability

Supports a context length of up to 8192 tokens, suitable for processing long documents.

Large-scale training data

Trained on a biomedical and clinical corpus containing 53.5 billion tokens, covering multiple domains and geographical regions.

Multi-source data training

Utilizes data from 20 different datasets, addressing the limitations of relying on a single data source.

High performance

Achieves state-of-the-art performance on multiple biomedical and clinical NLP tasks.

Model Capabilities

Biomedical text understanding

Clinical text processing

Masked language modeling

Text classification

Information retrieval

Question answering system

Use Cases

Clinical text analysis

Radiology report analysis

Analyze radiology reports and extract key information.

Performs excellently in pulmonology-related tasks

Clinical note processing

Process clinical notes to support downstream tasks such as named entity recognition.

Performs well in internal medicine-related tasks

Biomedical research

Literature mining

Extract biomedical knowledge from PubMed and PMC literature.

Performs excellently in biomedical text understanding tasks

🚀 BioClinical ModernBERT

BioClinical ModernBERT offers two sizes: base (150M parameters) and large (396M parameters). You can find the model training checkpoints here, and our code is available in our GitHub repository.

🚀 Quick Start

BioClinical ModernBERT is a domain - adapted encoder based on ModernBERT. It can be used directly with the transformers library starting from v4.48.0. You can install the required library using the following command:

pip install -U transformers>=4.48.0

✨ Features

Two Sizes: Available in base (150M parameters) and large (396M parameters) versions.
Long - Context Processing: Incorporates long - context processing, which is beneficial for biomedical and clinical NLP tasks.
Trained on Large Corpus: Trained on over 53.5 billion tokens from the largest biomedical and clinical corpus to date.
Diverse Data Sources: Leverages 20 datasets from diverse institutions, domains, and geographic regions.

📦 Installation

You can install the necessary transformers library with the following command:

pip install -U transformers>=4.48.0

💻 Usage Examples

Basic Usage

Since BioClinical ModernBERT is a Masked Language Model (MLM), you can use the fill - mask pipeline or load it via AutoModelForMaskedLM.

Using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "thomas-sounack/BioClinical-ModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = "Mitochondria is the powerhouse of the [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token:  cell

Using a pipeline:

import torch
from transformers import pipeline
from pprint import pprint
pipe = pipeline(
    "fill-mask",
    model="thomas-sounack/BioClinical-ModernBERT-base",
    torch_dtype=torch.bfloat16,
)
input_text = "[MASK] is a disease caused by an uncontrolled division of abnormal cells in a part of the body."
results = pipe(input_text)
pprint(results)

Advanced Usage

To use BioClinical ModernBERT for downstream tasks like classification, retrieval, or QA, fine - tune it following standard BERT fine - tuning recipes.

⚠️ Important Note

If your GPU supports it, we recommend using BioClinical ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:

pip install flash-attn

💡 Usage Tip

BioClinical ModernBERT, similarly to ModernBERT, does not use token type IDs unlike some earlier BERT models. Most downstream usage is identical to standard BERT models on the Hugging Face Hub, except you can omit the token_type_ids parameter.

📚 Documentation

Model Summary

BioClinical ModernBERT builds on ModernBERT base and large. It brings long - context processing and significant improvements in speed and performance for biomedical and clinical NLP. By using 20 diverse datasets, it addresses a key limitation of prior clinical encoders.

Training

Data

BioClinical ModernBERT is trained on 50.7B tokens of biomedical text from PubMed and PMC, and 2.8B tokens of clinical text from 20 datasets. The details are shown in the following table:

Property	Details
Model Type	BioClinical ModernBERT (base and large)
Training Data	50.7B tokens from PubMed and PMC, 2.8B tokens from 20 clinical datasets

Name	Country	Clinical Source	Clinical Context	Samples	Tokens (M)
ACI - BENCH	US	Clinical Notes	Not Reported	207	0.1
ADE Corpus	Several	Clinical Notes	Not Reported	20,896	0.5
Brain MRI Stroke	Korea	Radiology Reports	Neurology	2,603	0.2
CheXpert Plus	US	Radiology Reports	Pulmonology	223,460	60.6
CHIFIR	Australia	Pathology Reports	Hematology / Oncology	283	0.1
CORAL	US	Progress Notes	Hematology / Oncology	240	0.7
Eye Gaze CXR	US	Radiology Reports	Pulmonology	892	0.03
Gout Chief Complaints	US	Chief Complaint	Internal Medicine	8,429	0.2
ID - 68	UK	Clinical Notes	Psychology	78	0.02
Inspect	US	Radiology Reports	Pulmonology	22,259	2.8
MedNLI	US	Clinical Notes	Internal Medicine	14,047	0.5
MedQA	US	National Medical Board Examination	Not Reported	14,366	2.0
MIMIC - III	US	Clinical Notes	Internal Medicine	2,021,411	1,047.7
MIMIC - IV Note	US	Clinical Notes	Internal Medicine	2,631,243	1,765.7
MTSamples	Not Reported	Clinical Notes	Internal Medicine	2,358	1.7
Negex	US	Discharge Summaries	Not Reported	2,056	0.1
PriMock57	UK	Simulated Patient Care	Internal Medicine	57	0.01
Q - Pain	US	Clinical Vignettes	Palliative Care	51	0.01
REFLACX	US	Radiology Reports	Pulmonology	2,543	0.1
Simulated Resp. Interviews	Canada	Simulated Patient Care	Pulmonology	272	0.6

Methodology

BioClinical ModernBERT base is trained in two phases. It is initialized from the last stable - phase checkpoint of ModernBERT base and trained with a learning rate of 3e - 4 and a batch size of 72.

Phase 1: Train on 160.5B tokens from PubMed, PMC, and the 20 clinical datasets. The learning rate remains constant, and the masking probability is set at 30%.
Phase 2: Train only on the 20 clinical datasets. The masking probability is reduced to 15%. The model is trained for 3 epochs with a 1 - sqrt learning rate decay.

Evaluation

The following table shows the evaluation results of BioClinical ModernBERT compared with other models:

	Model	Context Length	ChemProt	Phenotype	COS	Social History	DEID
Base	BioBERT	512	89.5	26.6	94.9	55.8	74.3
	Clinical BERT	512	88.3	25.8	95.0	55.2	74.2
	BioMed - RoBERTa	512	89.0	36.8	94.9	55.2	81.1
	Clinical - BigBird	4096	87.4	26.5	94.0	53.3	71.2
	Clinical - Longformer	4096	74.2	46.4	95.2	56.8	82.3
	Clinical ModernBERT	8192	86.9	54.9	93.7	53.8	44.4
	ModernBERT - base	8192	89.5	48.4	94.0	53.1	78.3
	BioClinical ModernBERT - base	8192	89.9	58.1	95.1	58.5	82.7
Large	ModernBERT - large	8192	90.2	58.3	94.4	54.8	82.1
	BioClinical ModernBERT - large	8192	90.8	60.8	95.1	57.1	83.8

📄 License

We release the BioClinical ModernBERT base and large model weights and training checkpoints under the MIT license.

📚 Documentation

If you use BioClinical ModernBERT in your work, please cite our preprint:

@misc{sounack2025bioclinicalmodernbertstateoftheartlongcontext,
      title={BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP}, 
      author={Thomas Sounack and Joshua Davis and Brigitte Durieux and Antoine Chaffin and Tom J. Pollard and Eric Lehman and Alistair E. W. Johnson and Matthew McDermott and Tristan Naumann and Charlotta Lindvall},
      year={2025},
      eprint={2506.10896},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.10896}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご