🚀 ModernCamemBERT
ModernCamemBERT is a French language model that pretrained on a large corpus of 1T tokens of High - Quality French text. It aims to explore the impact of model design on performance by comparing with other French models.
🚀 Quick Start
Installation
To use ModernCamemBERT, you first need to install the transformers
library. You can install it using the following command:
pip install transformers
Usage
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM
model = AutoModel.from_pretrained("almanach/moderncamembert-cv2-base")
tokenizer = AutoTokenizer.from_pretrained("almanach/moderncamembert-cv2-base")
✨ Features
- Large - scale Pretraining: ModernCamemBERT is pretrained on a large corpus of 1T tokens of High - Quality French text, including data from
togethercomputer/RedPajama-Data-V2
, almanach/HALvest
, and wikimedia/wikipedia
.
- Controlled Study: By pretraining on the same dataset as CamemBERTaV2, it isolates the effect of model design.
- Context Length Expansion: The model was first trained with a 1024 - token context length, which was later increased to 8192 tokens during pretraining.
📦 Installation
pip install transformers
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM
model = AutoModel.from_pretrained("almanach/moderncamembert-cv2-base")
tokenizer = AutoTokenizer.from_pretrained("almanach/moderncamembert-cv2-base")
📚 Documentation
Fine - tuning Results
Datasets used for fine - tuning include NER (FTB), the FLUE benchmark (XNLI, CLS, PAWS - X), and the French Question Answering Dataset (FQuAD).
Model |
FTB - NER |
CLS |
PAWS - X |
XNLI |
F1 (FQuAD) |
EM (FQuAD) |
CamemBERT |
89.97 |
94.62 |
91.36 |
81.95 |
80.98 |
62.51 |
CamemBERTa |
90.33 |
94.92 |
91.67 |
82.00 |
81.15 |
62.01 |
CamemBERTv2 |
81.99 |
95.07 |
92.00 |
81.75 |
80.98 |
61.35 |
CamemBERTav2 |
93.40 |
95.63 |
93.06 |
84.82 |
83.04 |
64.29 |
ModernCamemBERT - CV2 |
92.17 |
94.86 |
92.71 |
82.85 |
81.68 |
62.00 |
ModernCamemBERT |
91.33 |
94.92 |
92.52 |
83.62 |
82.19 |
62.66 |
Finetuned models are available in the following collection: ModernCamembert Models
Pretraining Codebase
We use the pretraining codebase from the ModernBERT repository for all ModernCamemBERT models.
🔧 Technical Details
ModernCamemBERT was trained using the Masked Language Modeling (MLM) objective with a 30% mask rate on 1T tokens on 48 H100 GPUs. The dataset used for training is a combination of French RedPajama - V2 filtered using heuristic and semantic filtering, French scientific documents from HALvest, and the French Wikipedia. Semantic filtering was done by fine - tuning a BERT classifier trained on a document quality dataset automatically labeled by LLama - 3 70B.
We also re - use the old [CamemBERTav2](https://huggingface.co/almanach/camembertav2 - base) tokenizer. The model was first trained with 1024 context length which was then increased to 8192 tokens later in the pretraining. More details about the training process can be found in the ModernCamemBERT paper.
📄 License
This project is licensed under the MIT license.
📖 Citation
@misc{antoun2025modernbertdebertav3examiningarchitecture,
title={ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance},
author={Wissam Antoun and Benoît Sagot and Djamé Seddah},
year={2025},
eprint={2504.08716},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.08716},
}
💡 Usage Tip
We recommend using the ModernCamemBERT model for tasks that require a large context length or efficient inference speed. Other tasks should still use the CamemBERTaV2 model, which is still the best - performing model on most benchmarks.
Property |
Details |
Model Type |
ModernCamemBERT |
Training Data |
togethercomputer/RedPajama-Data-V2, almanach/HALvest, wikimedia/wikipedia |
Pipeline Tag |
fill - mask |
Tags |
modernbert, camembert |