🚀 ModernCamemBERT
ModernCamemBERT is a French language model. It is pretrained on a large corpus of 1T tokens of High - Quality French text. It serves as the French version of the [ModernBERT](https://huggingface.co/answerdotai/ModernBERT - base) model. The model aims to offer efficient performance for tasks requiring large context lengths or fast inference speed.
🚀 Quick Start
ModernCamemBERT was trained using the Masked Language Modeling (MLM) objective with a 30% mask rate on 1T tokens across 48 H100 GPUs. The training dataset combines French [RedPajama - V2](https://huggingface.co/datasets/togethercomputer/RedPajama - Data - V2) (filtered via heuristic and semantic filtering), French scientific documents from HALvest, and the French Wikipedia. The semantic filtering was achieved by fine - tuning a BERT classifier trained on a document quality dataset automatically labeled by LLama - 3 70B.
We reused the old [CamemBERTav2](https://huggingface.co/almanach/camembertav2 - base) tokenizer. Initially, the model was trained with a 1024 - token context length, which was later increased to 8192 tokens during pretraining. More details about the training process can be found in the ModernCamemBERT paper.
The goal of ModernCamemBERT was to conduct a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, to isolate the effect of model design. The results show that the previous model generation outperforms in sample efficiency and overall benchmark performance, with ModernBERT's main advantage being faster training and inference speed. However, the newly proposed model still brings meaningful architectural improvements compared to earlier models like the BERT and RoBERTa CamemBERT/v2 models. Additionally, high - quality pre - training data accelerates convergence but does not significantly enhance the final performance, indicating potential benchmark saturation.
✨ Features
- Large - scale Pretraining: Trained on 1T tokens of High - Quality French text.
- Masked Language Modeling: Trained with a 30% mask rate using MLM.
- Context Length Increase: Context length increased from 1024 to 8192 tokens during pretraining.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM
model = AutoModel.from_pretrained("almanach/moderncamembert - base")
tokenizer = AutoTokenizer.from_pretrained("almanach/moderncamembert - base")
📚 Documentation
Fine - tuning Results
Datasets used for fine - tuning include NER (FTB), the FLUE benchmark (XNLI, CLS, PAWS - X), and the French Question Answering Dataset (FQuAD).
Model |
FTB - NER |
CLS |
PAWS - X |
XNLI |
F1 (FQuAD) |
EM (FQuAD) |
CamemBERT |
89.97 |
94.62 |
91.36 |
81.95 |
80.98 |
62.51 |
CamemBERTa |
90.33 |
94.92 |
91.67 |
82.00 |
81.15 |
62.01 |
CamemBERTv2 |
81.99 |
95.07 |
92.00 |
81.75 |
80.98 |
61.35 |
CamemBERTav2 |
93.40 |
95.63 |
93.06 |
84.82 |
83.04 |
64.29 |
ModernCamemBERT - CV2 |
92.17 |
94.86 |
92.71 |
82.85 |
81.68 |
62.00 |
ModernCamemBERT |
91.33 |
94.92 |
92.52 |
83.62 |
82.19 |
62.66 |
Finetuned models are available in the following collection: [ModernCamembert Models](https://huggingface.co/collections/almanach/moderncamembert - 67f7e6d85ede5f7cfc1ce012)
Pretraining Codebase
We use the pretraining codebase from the ModernBERT repository for all ModernCamemBERT models.
🔧 Technical Details
- Model Type: French language model, the French version of ModernBERT.
- Training Data: A combination of French [RedPajama - V2](https://huggingface.co/datasets/togethercomputer/RedPajama - Data - V2) (filtered), French scientific documents from HALvest, and the French Wikipedia.
- Training Objective: Masked Language Modeling (MLM) with a 30% mask rate.
- Hardware: 48 H100 GPUs.
Property |
Details |
Model Type |
French language model, French version of ModernBERT |
Training Data |
Combination of French RedPajama - V2 (filtered), French scientific documents from HALvest, and French Wikipedia |
Training Objective |
Masked Language Modeling (MLM) with 30% mask rate |
Hardware |
48 H100 GPUs |
📄 License
The model is released under the MIT license.
📖 Citation
@misc{antoun2025modernbertdebertav3examiningarchitecture,
title={ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance},
author={Wissam Antoun and Benoît Sagot and Djamé Seddah},
year={2025},
eprint={2504.08716},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.08716},
}