CamemBERT av2-base Open-source French Language Model - Free for Multi-scenario French NLP Task Processing

Camembertav2 Base

Developed by almanach

CamemBERTav2 is a French language model pretrained on 275 billion French text tokens, utilizing the DebertaV2 architecture, and excels in multiple French NLP tasks.

Large Language Model

Transformers

FrenchOpen Source License:MIT #French NLP #Large-scale Pretraining #DeBERTa Architecture

Downloads 2,972

Release Time : 11/14/2024

Model Overview

The second-generation CamemBERTa model, optimized for French, supports various natural language processing tasks.

Model Features

Large-scale Pretraining

Trained on 275 billion French text tokens, significantly surpassing the original model's 32 billion tokens.

Improved Tokenizer

New WordPiece tokenizer supporting 32,768 tokens, with optimized number processing and special character support.

Extended Context Window

Context window extended to 1,024 tokens, enabling processing of longer texts.

Multi-task Performance Enhancement

Outperforms previous models in tasks like POS tagging, named entity recognition, and question answering.

Model Capabilities

French text understanding

Feature extraction

Masked language modeling

POS tagging

Named entity recognition

Text classification

Question answering system

Use Cases

Natural Language Processing

French Text Analysis

Used for POS tagging and dependency parsing of French texts.

Achieves 97.71% UPOS accuracy on GSD/Rhapsodie/Sequoia/FSMB datasets.

Named Entity Recognition

Identifies named entities in French texts.

Achieves 93.40% F1 score on the FTB-NER dataset.

Question Answering System

Builds French question answering systems.

Achieves 83.04% F1 score and 64.29% EM score on the FQuAD dataset.

Academic Research

Scientific Literature Processing

Processes and analyzes French scientific literature.

🚀 CamemBERT(a)-v2: A Smarter French Language Model Aged to Perfection

CamemBERTav2 is a French language model pretrained on a large corpus of 275B tokens of French text. It's based on the DebertaV2 architecture and is the second version of the CamemBERTa model. Trained using the Replaced Token Detection (RTD) objective with a 20% mask rate on 275B tokens across 32 H100 GPUs, the training dataset combines French OSCAR dumps from the CulturaX Project, French scientific documents from HALvest, and the French Wikipedia.

This model can directly replace the original CamemBERTa model. Note that the new tokenizer differs from the original CamemBERTa tokenizer, so you'll need to use Fast Tokenizers to utilize the model. It's compatible with DebertaV2TokenizerFast from the transformers library, even though the original DebertaV2TokenizerFast was sentencepiece - based.

✨ Features

Larger Pretraining Dataset: The new version uses a much larger pretraining dataset with 275B unique tokens, compared to the previous ~32B.
New Tokenizer: A newly built tokenizer based on WordPiece with 32,768 tokens. It adds newline and tab characters, supports emojis, and better handles numbers by splitting them into two - digit tokens.
Extended Context Window: It has an extended context window of 1024 tokens.

For more details, refer to the CamemBERTv2 paper.

📦 Installation

This section doesn't explicitly mention installation steps, so it's skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM

camembertav2 = AutoModel.from_pretrained("almanach/camembertav2-base")
tokenizer = AutoTokenizer.from_pretrained("almanach/camembertav2-base")

📚 Documentation

Fine - tuning Results

Datasets used for fine - tuning include POS tagging and Dependency Parsing (GSD, Rhapsodie, Sequoia, FSMB), NER (FTB), the FLUE benchmark (XNLI, CLS, PAWS - X), the French Question Answering Dataset (FQuAD), Social Media NER (Counter - NER), and Medical NER (CAS1, CAS2, E3C, EMEA, MEDLINE).

Model	UPOS	LAS	FTB - NER	CLS	PAWS - X	XNLI	F1 (FQuAD)	EM (FQuAD)	Counter - NER	Medical - NER
CamemBERT	97.59	88.69	89.97	94.62	91.36	81.95	80.98	62.51	84.18	70.96
CamemBERTa	97.57	88.55	90.33	94.92	91.67	82.00	81.15	62.01	87.37	71.86
CamemBERT - bio	-	-	-	-	-	-	-	-	-	73.96
CamemBERTv2	97.66	88.64	91.99	95.07	92.00	81.75	80.98	61.35	87.46	72.77
CamemBERTav2	97.71	88.65	93.40	95.63	93.06	84.82	83.04	64.29	89.53	73.98

Finetuned models are available in the following collection: CamemBERTav2 Finetuned Models

Pretraining Codebase

We use the pretraining codebase from the CamemBERTa repository for all v2 models.

📄 License

The model is released under the MIT license.

📖 Citation

@misc{antoun2024camembert20smarterfrench,
      title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection},
      author={Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
      year={2024},
      eprint={2411.08868},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.08868},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご