🚀 ModernBERT
ModernBERT is a modernized bidirectional encoder - only Transformer model pre - trained on a large amount of English and code data, suitable for long - context tasks and a wide range of downstream applications.
🚀 Quick Start
You can use these models directly with the transformers
library. Until the next transformers
release, doing so requires installing transformers from main:
pip install git+https://github.com/huggingface/transformers.git
Since ModernBERT is a Masked Language Model (MLM), you can use the fill - mask
pipeline or load it via AutoModelForMaskedLM
. To use ModernBERT for downstream tasks like classification, retrieval, or QA, fine - tune it following standard BERT fine - tuning recipes.
✨ Features
- ModernBERT is a modernized bidirectional encoder - only Transformer model (BERT - style) pre - trained on 2 trillion tokens of English and code data with a native context length of up to 8,192 tokens.
- It leverages recent architectural improvements such as Rotary Positional Embeddings (RoPE) for long - context support, Local - Global Alternating Attention for efficiency on long inputs, and Unpadding and Flash Attention for efficient inference.
- Available in two sizes: [ModernBERT - base](https://huggingface.co/answerdotai/ModernBERT - base) with 22 layers and 149 million parameters, and [ModernBERT - large](https://huggingface.co/answerdotai/ModernBERT - large) with 28 layers and 395 million parameters.
📦 Installation
To use ModernBERT, you need to install the transformers
library from the main branch:
pip install git+https://github.com/huggingface/transformers.git
If your GPU supports it, to use ModernBERT with Flash Attention 2 for the highest efficiency, install Flash Attention as follows:
pip install flash - attn
💻 Usage Examples
Basic Usage
Using AutoModelForMaskedLM
:
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "DeepMount00/ModernBERT - base - ita"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = "La capitale dell'Italia è [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
Advanced Usage
Using a pipeline:
import torch
from transformers import pipeline
from pprint import pprint
pipe = pipeline(
"fill - mask",
model="answerdotai/ModernBERT - base",
torch_dtype=torch.bfloat16,
)
input_text = "He walked to the [MASK]."
results = pipe(input_text)
pprint(results)
📚 Documentation
For more information about ModernBERT, we recommend our release blog post for a high - level overview, and our arXiv pre - print for in - depth information.
🔧 Technical Details
We evaluate ModernBERT across a range of tasks, including natural language understanding (GLUE), general retrieval (BEIR), long - context retrieval (MLDR), and code retrieval (CodeSearchNet and StackQA).
- On GLUE, ModernBERT - base surpasses other similarly - sized encoder models, and ModernBERT - large is second only to Deberta - v3 - large.
- For general retrieval tasks, ModernBERT performs well on BEIR in both single - vector (DPR - style) and multi - vector (ColBERT - style) settings.
- Thanks to the inclusion of code data in its training mixture, ModernBERT as a backbone also achieves new state - of - the - art code retrieval results on CodeSearchNet and StackQA.
📄 License
We release the ModernBERT model architectures, model weights, training codebase under the Apache 2.0 license.
📚 Citation
@misc{modernbert,
title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
year={2024},
eprint={2412.13663},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.13663},
}
⚠️ Important Note
ModernBERT does not use token type IDs, unlike some earlier BERT models. Most downstream usage is identical to standard BERT models on the Hugging Face Hub, except you can omit the token_type_ids
parameter.
⚠️ Important Note
ModernBERT’s training data is primarily English and code, so performance may be lower for other languages. While it can handle long sequences efficiently, using the full 8,192 tokens window may be slower than short - context inference. Like any large language model, ModernBERT may produce representations that reflect biases present in its training data. Verify critical or sensitive outputs before relying on them.
Property |
Details |
Library Name |
transformers |
Model Type |
ModernBERT (a modernized bidirectional encoder - only Transformer model) |
Training Data |
2 trillion tokens of English and code data |
License |
Apache 2.0 |
Tags |
fill - mask, masked - lm, long - context, modernbert |
Pipeline Tag |
fill - mask |
Languages Supported |
English, Italian |