Distilcamembert-base Open-source French Language Model - Reduce Complexity without Compromising Performance and Use for Free

Distilcamembert Base

Developed by cmarkea

DistilCamemBERT is a distilled version of the French CamemBERT model, significantly reducing model complexity while maintaining performance through knowledge distillation techniques.

Large Language Model

Transformers

FrenchOpen Source License:MIT #French Lightweight BERT #Distilled Model #Masked Language Modeling

Downloads 15.79k

Release Time : 3/2/2022

Model Overview

This model is a distilled version of the French RoBERTa model CamemBERT, suitable for various natural language processing tasks such as text classification and semantic matching.

Model Features

Knowledge Distillation Technique

Significantly reduces model complexity while maintaining performance through distillation techniques, with loss functions including distillation loss, cosine loss, and MLM loss.

High Performance

Excels in multiple French NLP tasks, such as achieving an F1 score of 83% in text classification and 98% in named entity recognition.

Lightweight

Compared to the original CamemBERT model, the distilled version is more lightweight and suitable for resource-constrained environments.

Model Capabilities

Text classification

Semantic matching

Natural language inference

Named entity recognition

Masked filling

Use Cases

Text Processing

Text Classification

Classify French texts, such as sentiment analysis or topic classification.

Achieves an F1 score of 83% on the FLUE dataset.

Semantic Matching

Determine the semantic similarity between two French texts.

Achieves an F1 score of 77% on the FLUE dataset.

Information Extraction

Named Entity Recognition

Identify named entities from French texts, such as person names and locations.

Achieves an F1 score of 98% on the wikiner_fr dataset.

🚀 DistilCamemBERT

We present DistilCamemBERT, a distilled version of the well - known CamemBERT, a French RoBERTa model. The goal of distillation is to significantly reduce the model's complexity while maintaining its performance. The proof of concept is detailed in the DistilBERT paper, and the training code is inspired by DistilBERT.

🚀 Quick Start

DistilCamemBERT is a distilled version of the French RoBERTa model CamemBERT. It aims to reduce model complexity while preserving performance.

✨ Features

Model Distillation: Drastically reduces the complexity of the model while preserving the performances.
Custom Loss Function: The training loss function is a combination of DistilLoss, CosineLoss, and MLMLoss.
Same Dataset: Trained on the same dataset (OSCAR) as the original CamemBERT to limit bias.

🔧 Technical Details

Loss function

The training of the distilled model (student model) is designed to be as close as possible to the original model (teacher model). The loss function consists of 3 parts:

DistilLoss: A distillation loss that measures the similarity between the output probabilities of the student and teacher models using cross - entropy loss on the MLM task.
CosineLoss: A cosine embedding loss applied to the last hidden layers of the student and teacher models to ensure collinearity.
MLMLoss: A Masked Language Modeling (MLM) task loss to train the student model on the original task of the teacher model.

The final loss function is a combination of these three losses with the following weighting:

$$Loss = 0.5 \times DistilLoss + 0.3 \times CosineLoss + 0.2 \times MLMLoss$$

Dataset

To limit the bias between the student and teacher models, the dataset used for DistilCamemBERT training is the same as that of camembert - base: OSCAR. The French part of this dataset occupies approximately 140 GB on a hard drive.

Training

The model was pre - trained on an nVidia Titan RTX for 18 days.

📚 Documentation

Evaluation results

Dataset name	f1 - score
FLUE CLS	83%
FLUE PAWS - X	77%
FLUE XNLI	77%
[wikiner_fr](https://huggingface.co/datasets/Jean - Baptiste/wikiner_fr) NER	98%

How to use DistilCamemBERT

Basic Usage

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("cmarkea/distilcamembert-base")
model = AutoModel.from_pretrained("cmarkea/distilcamembert-base")
model.eval()
...

Advanced Usage

from transformers import pipeline

model_fill_mask = pipeline("fill-mask", model="cmarkea/distilcamembert-base", tokenizer="cmarkea/distilcamembert-base")
results = model_fill_mask("Le camembert est <mask> :)")

results
[{'sequence': '<s> Le camembert est délicieux :)</s>', 'score': 0.3878222405910492, 'token': 7200},
 {'sequence': '<s> Le camembert est excellent :)</s>', 'score': 0.06469205021858215, 'token': 2183}, 
 {'sequence': '<s> Le camembert est parfait :)</s>', 'score': 0.04534877464175224, 'token': 1654}, 
 {'sequence': '<s> Le camembert est succulent :)</s>', 'score': 0.04128391295671463, 'token': 26202}, 
 {'sequence': '<s> Le camembert est magnifique :)</s>', 'score': 0.02425697259604931, 'token': 1509}]

📄 License

This project is licensed under the MIT license.

📖 Citation

@inproceedings{delestre:hal-03674695,
  TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
  AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
  URL = {https://hal.archives-ouvertes.fr/hal-03674695},
  BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
  ADDRESS = {Vannes, France},
  YEAR = {2022},
  MONTH = Jul,
  KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
  PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
  HAL_ID = {hal-03674695},
  HAL_VERSION = {v1},
}

📦 Additional Information

Property	Details
Model Type	Distilled French RoBERTa model
Training Data	OSCAR

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご