ModernCamemBERT: An Open-Source French Language Model - Leveraging 1T Corpus for Fast and Accurate Long-Text Processing

Moderncamembert Base

Developed by almanach

ModernCamemBERT is a French language model pre-trained on a 1T high-quality French text corpus. It is the French version of ModernBERT, focusing on long contexts and efficient inference speed.

Large Language Model

Transformers

FrenchOpen Source License:MIT #French long text processing #Efficient inference #Masked language modeling

Downloads 213

Release Time : 4/11/2025

Model Overview

ModernCamemBERT is a French language model trained with the masked language modeling (MLM) objective, suitable for tasks that require long contexts or efficient inference speed.

Model Features

High-quality pre-training data

Trained on a 1T high-quality French text corpus with tokens, including RedPajama-V2, French scientific literature, and French Wikipedia.

Long context support

Initially trained with a context length of 1024, and then extended to 8192 tokens during the pre-training phase.

Efficient inference

It has faster training and inference speeds compared to traditional architectures.

Semantic filtering

Semantic filtering is performed through a BERT classifier trained on a document quality dataset automatically annotated based on LLama-3 70B.

Model Capabilities

French text understanding

Masked language modeling

Long context processing

Use Cases

Natural language processing

Named entity recognition

Named entity recognition tasks in French text

Achieved an F1 score of 91.33 on the FTB-NER dataset

Text classification

French text classification tasks

Achieved an accuracy of 94.92 on the CLS dataset

Semantic similarity

Semantic similarity judgment of French text

Achieved an accuracy of 92.52 on the PAWS-X dataset

Question answering system

French question answering

French reading comprehension question answering tasks

Achieved an F1 score of 82.19 and an EM score of 62.66 on the FQuAD dataset

🚀 ModernCamemBERT

ModernCamemBERT is a French language model. It is pretrained on a large corpus of 1T tokens of High - Quality French text. It serves as the French version of the [ModernBERT](https://huggingface.co/answerdotai/ModernBERT - base) model. The model aims to offer efficient performance for tasks requiring large context lengths or fast inference speed.

🚀 Quick Start

ModernCamemBERT was trained using the Masked Language Modeling (MLM) objective with a 30% mask rate on 1T tokens across 48 H100 GPUs. The training dataset combines French [RedPajama - V2](https://huggingface.co/datasets/togethercomputer/RedPajama - Data - V2) (filtered via heuristic and semantic filtering), French scientific documents from HALvest, and the French Wikipedia. The semantic filtering was achieved by fine - tuning a BERT classifier trained on a document quality dataset automatically labeled by LLama - 3 70B.

We reused the old [CamemBERTav2](https://huggingface.co/almanach/camembertav2 - base) tokenizer. Initially, the model was trained with a 1024 - token context length, which was later increased to 8192 tokens during pretraining. More details about the training process can be found in the ModernCamemBERT paper.

The goal of ModernCamemBERT was to conduct a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, to isolate the effect of model design. The results show that the previous model generation outperforms in sample efficiency and overall benchmark performance, with ModernBERT's main advantage being faster training and inference speed. However, the newly proposed model still brings meaningful architectural improvements compared to earlier models like the BERT and RoBERTa CamemBERT/v2 models. Additionally, high - quality pre - training data accelerates convergence but does not significantly enhance the final performance, indicating potential benchmark saturation.

✨ Features

Large - scale Pretraining: Trained on 1T tokens of High - Quality French text.
Masked Language Modeling: Trained with a 30% mask rate using MLM.
Context Length Increase: Context length increased from 1024 to 8192 tokens during pretraining.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM

model = AutoModel.from_pretrained("almanach/moderncamembert - base")
tokenizer = AutoTokenizer.from_pretrained("almanach/moderncamembert - base")

📚 Documentation

Fine - tuning Results

Datasets used for fine - tuning include NER (FTB), the FLUE benchmark (XNLI, CLS, PAWS - X), and the French Question Answering Dataset (FQuAD).

Model	FTB - NER	CLS	PAWS - X	XNLI	F1 (FQuAD)	EM (FQuAD)
CamemBERT	89.97	94.62	91.36	81.95	80.98	62.51
CamemBERTa	90.33	94.92	91.67	82.00	81.15	62.01
CamemBERTv2	81.99	95.07	92.00	81.75	80.98	61.35
CamemBERTav2	93.40	95.63	93.06	84.82	83.04	64.29
ModernCamemBERT - CV2	92.17	94.86	92.71	82.85	81.68	62.00
ModernCamemBERT	91.33	94.92	92.52	83.62	82.19	62.66

Finetuned models are available in the following collection: [ModernCamembert Models](https://huggingface.co/collections/almanach/moderncamembert - 67f7e6d85ede5f7cfc1ce012)

Pretraining Codebase

We use the pretraining codebase from the ModernBERT repository for all ModernCamemBERT models.

🔧 Technical Details

Model Type: French language model, the French version of ModernBERT.
Training Data: A combination of French [RedPajama - V2](https://huggingface.co/datasets/togethercomputer/RedPajama - Data - V2) (filtered), French scientific documents from HALvest, and the French Wikipedia.
Training Objective: Masked Language Modeling (MLM) with a 30% mask rate.
Hardware: 48 H100 GPUs.

Property	Details
Model Type	French language model, French version of ModernBERT
Training Data	Combination of French RedPajama - V2 (filtered), French scientific documents from HALvest, and French Wikipedia
Training Objective	Masked Language Modeling (MLM) with 30% mask rate
Hardware	48 H100 GPUs

📄 License

The model is released under the MIT license.

📖 Citation

@misc{antoun2025modernbertdebertav3examiningarchitecture,
      title={ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance}, 
      author={Wissam Antoun and Benoît Sagot and Djamé Seddah},
      year={2025},
      eprint={2504.08716},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.08716}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご