Open-source French language model moderncamembert-cv2-base - Pre-trained on trillions of French texts

Home

Moderncamembert Cv2 Base

Developed by almanach

A French language model pre-trained on 1 trillion high-quality French texts, the French version of ModernBERT

Large Language Model

Transformers

FrenchOpen Source License:MIT #French Language Model #Long Context Processing #Efficient Inference

Downloads 232

Release Time : 4/11/2025

Model Overview

ModernCamemBERT is a French Transformer model using Masked Language Modeling (MLM) objective, trained on 48 H100 GPUs, supporting long context processing

Model Features

Large-scale Pre-training

Trained on 1 trillion tokens of high-quality French corpus, including RedPajama-V2, HALvest scientific literature, and French Wikipedia

Efficient Architecture

Faster training and inference speed compared to traditional BERT architecture

Long Context Support

Initial pre-training with 1024 context length, later extended to 8192 tokens

Strict Data Filtering

Semantic filtering via LLama-3 70B-based BERT classifier to ensure data quality

Model Capabilities

French text understanding

Masked language modeling

Long text processing

Use Cases

Natural Language Processing

Named Entity Recognition

Named entity recognition tasks in French text

Achieved 92.17 F1 score on FTB-NER dataset

Text Classification

French text classification tasks

Achieved 94.86 accuracy on CLS dataset

Question Answering

French question answering system development

Achieved 81.68 F1 score on FQuAD dataset

🚀 ModernCamemBERT

ModernCamemBERT is a French language model that pretrained on a large corpus of 1T tokens of High - Quality French text. It aims to explore the impact of model design on performance by comparing with other French models.

🚀 Quick Start

Installation

To use ModernCamemBERT, you first need to install the transformers library. You can install it using the following command:

pip install transformers

Usage

from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM

model = AutoModel.from_pretrained("almanach/moderncamembert-cv2-base")
tokenizer = AutoTokenizer.from_pretrained("almanach/moderncamembert-cv2-base")

✨ Features

Large - scale Pretraining: ModernCamemBERT is pretrained on a large corpus of 1T tokens of High - Quality French text, including data from togethercomputer/RedPajama-Data-V2, almanach/HALvest, and wikimedia/wikipedia.
Controlled Study: By pretraining on the same dataset as CamemBERTaV2, it isolates the effect of model design.
Context Length Expansion: The model was first trained with a 1024 - token context length, which was later increased to 8192 tokens during pretraining.

📦 Installation

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM

model = AutoModel.from_pretrained("almanach/moderncamembert-cv2-base")
tokenizer = AutoTokenizer.from_pretrained("almanach/moderncamembert-cv2-base")

📚 Documentation

Fine - tuning Results

Datasets used for fine - tuning include NER (FTB), the FLUE benchmark (XNLI, CLS, PAWS - X), and the French Question Answering Dataset (FQuAD).

Model	FTB - NER	CLS	PAWS - X	XNLI	F1 (FQuAD)	EM (FQuAD)
CamemBERT	89.97	94.62	91.36	81.95	80.98	62.51
CamemBERTa	90.33	94.92	91.67	82.00	81.15	62.01
CamemBERTv2	81.99	95.07	92.00	81.75	80.98	61.35
CamemBERTav2	93.40	95.63	93.06	84.82	83.04	64.29
ModernCamemBERT - CV2	92.17	94.86	92.71	82.85	81.68	62.00
ModernCamemBERT	91.33	94.92	92.52	83.62	82.19	62.66

Finetuned models are available in the following collection: ModernCamembert Models

Pretraining Codebase

We use the pretraining codebase from the ModernBERT repository for all ModernCamemBERT models.

🔧 Technical Details

ModernCamemBERT was trained using the Masked Language Modeling (MLM) objective with a 30% mask rate on 1T tokens on 48 H100 GPUs. The dataset used for training is a combination of French RedPajama - V2 filtered using heuristic and semantic filtering, French scientific documents from HALvest, and the French Wikipedia. Semantic filtering was done by fine - tuning a BERT classifier trained on a document quality dataset automatically labeled by LLama - 3 70B.

We also re - use the old [CamemBERTav2](https://huggingface.co/almanach/camembertav2 - base) tokenizer. The model was first trained with 1024 context length which was then increased to 8192 tokens later in the pretraining. More details about the training process can be found in the ModernCamemBERT paper.

📄 License

This project is licensed under the MIT license.

📖 Citation

@misc{antoun2025modernbertdebertav3examiningarchitecture,
      title={ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance}, 
      author={Wissam Antoun and Benoît Sagot and Djamé Seddah},
      year={2025},
      eprint={2504.08716},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.08716}, 
}

💡 Usage Tip

We recommend using the ModernCamemBERT model for tasks that require a large context length or efficient inference speed. Other tasks should still use the CamemBERTaV2 model, which is still the best - performing model on most benchmarks.

Property	Details
Model Type	ModernCamemBERT
Training Data	togethercomputer/RedPajama-Data-V2, almanach/HALvest, wikimedia/wikipedia
Pipeline Tag	fill - mask
Tags	modernbert, camembert

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご