**Camembertv2-base Open-Source French Language Model: Empowering French Text Processing with Massive Corpora**

Camembertv2 Base

Developed by almanach

CamemBERTv2 is a French language model pre-trained on a 275 billion-word French text corpus, serving as the second-generation version of CamemBERT. It adopts the RoBERTa architecture with optimized tokenizer and training data.

Large Language Model

Transformers

FrenchOpen Source License:MIT #French Language Model #Masked Language Modeling #Large Corpus Pre-training

Downloads 1,512

Release Time : 11/14/2024

Model Overview

CamemBERTv2 is a more intelligent French language model suitable for various natural language processing tasks, such as text infilling, part-of-speech tagging, named entity recognition, etc.

Model Features

Large-scale Pre-training Data

Pre-trained on 275 billion unique tokens, significantly surpassing the original version's 32 billion.

New Tokenizer

Utilizes WordPiece tokenizer with support for emojis and optimized number handling (splitting into two-digit tokens).

Extended Context Window

Context window extended to 1024 tokens, enhancing long-text processing capabilities.

High-performance Fine-tuning

Excels in multiple French NLP tasks, such as part-of-speech tagging and named entity recognition.

Model Capabilities

Text Infilling

Part-of-speech Tagging

Dependency Parsing

Named Entity Recognition

Question Answering

Text Classification

Use Cases

Natural Language Processing

French Text Infilling

Used to fill in missing parts of French texts.

Part-of-speech Tagging

Performs part-of-speech tagging on French texts.

UPOS accuracy 97.66

Named Entity Recognition

Identifies named entities in French texts.

FTB-NER F1 score 91.99

Question Answering

French Question Answering

Used to build French question-answering systems.

FQuAD F1 score 80.98

🚀 CamemBERT(a)-v2: A Smarter French Language Model Aged to Perfection

CamemBERTv2 is a French language model pretrained on a large corpus. It's based on the RoBERTa architecture and offers significant improvements over its predecessor.

🚀 Quick Start

CamemBERTv2 is a French language model pretrained on a large corpus of 275B tokens of French text. It is the second version of the CamemBERT model, which is based on the RoBERTa architecture. CamemBERTv2 is trained using the Masked Language Modeling (MLM) objective with 40% mask rate for 3 epochs on 32 H100 GPUs. The dataset used for training is a combination of French OSCAR dumps from the CulturaX Project, French scientific documents from HALvest, and the French Wikipedia.

The model is a drop-in replacement for the original CamemBERT model. Note that the new tokenizer is different from the original CamemBERT tokenizer, so you will need to use Fast Tokenizers to use the model. It will work with CamemBERTTokenizerFast from transformers library even if the original CamemBERTTokenizer was sentencepiece-based.

Check the CamemBERTav2 model, a much stronger French language model, based on DeBERTaV3, here.

✨ Features

Model update details

The new update includes:

Much larger pretraining dataset: 275B unique tokens (previously ~32B)
A newly built tokenizer based on WordPiece with 32,768 tokens, addition of the newline and tab characters, support emojis, and better handling of numbers (numbers are split into two digits tokens)
Extended context window of 1024 tokens

More details are available in the CamemBERTv2 paper.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM

camembertv2 = AutoModelForMaskedLM.from_pretrained("almanach/camembertv2-base")
tokenizer = AutoTokenizer.from_pretrained("almanach/camembertv2-base")

📚 Documentation

Fine-tuning Results

Datasets: POS tagging and Dependency Parsing (GSD, Rhapsodie, Sequoia, FSMB), NER (FTB), the FLUE benchmark (XNLI, CLS, PAWS-X), the French Question Answering Dataset (FQuAD), Social Media NER (Counter-NER), and Medical NER (CAS1, CAS2, E3C, EMEA, MEDLINE).

Model	UPOS	LAS	FTB-NER	CLS	PAWS-X	XNLI	F1 (FQuAD)	EM (FQuAD)	Counter-NER	Medical-NER
CamemBERT	97.59	88.69	89.97	94.62	91.36	81.95	80.98	62.51	84.18	70.96
CamemBERTa	97.57	88.55	90.33	94.92	91.67	82.00	81.15	62.01	87.37	71.86
CamemBERT-bio	-	-	-	-	-	-	-	-	-	73.96
CamemBERTv2	97.66	88.64	91.99	95.07	92.00	81.75	80.98	61.35	87.46	72.77
CamemBERTav2	97.71	88.65	93.40	95.63	93.06	84.82	83.04	64.29	89.53	73.98

Finetuned models are available in the following collection: CamemBERTv2 Finetuned Models

Pretraining Codebase

We use the pretraining codebase from the CamemBERTa repository for all v2 models.

📄 License

The model is released under the MIT license.

📖 Citation

@misc{antoun2024camembert20smarterfrench,
      title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection},
      author={Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
      year={2024},
      eprint={2411.08868},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.08868},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご