The open-source French sentiment analysis model distilcamembert-base-sentiment

Distilcamembert Base Sentiment

Developed by cmarkea

A French sentiment analysis model fine-tuned based on DistilCamemBERT, with inference time halved while maintaining the same performance.

Text Classification

Transformers

FrenchOpen Source License:MIT #French Sentiment Analysis #Efficient Inference #Multi-domain Review Classification

Downloads 60.12k

Release Time : 3/2/2022

Model Overview

This model is used for sentiment analysis of French texts, supporting five categories of sentiment ratings (1-5 stars). Based on the DistilCamemBERT architecture, it was trained on Amazon reviews and Allociné movie review datasets, optimizing inference speed.

Model Features

Efficient Inference

Compared to the original CamemBERT model, inference time is reduced by 50% while maintaining the same performance.

Multi-source Data Training

Combines Amazon short reviews and Allociné long movie reviews to reduce data bias.

Five-level Sentiment Classification

Supports fine-grained sentiment ratings (1-5 stars) instead of simple positive/negative classification.

Model Capabilities

French text sentiment analysis

Fine-grained sentiment rating (1-5 stars)

Handles both short and long texts

Use Cases

Product Review Analysis

E-commerce Review Analysis

Analyze user reviews on platforms like Amazon to understand product satisfaction.

Accuracy 61.01%, top-2 accuracy 88.80%

Movie Review Analysis

Film Evaluation Analysis

Analyze movie reviews on platforms like Allociné to assess film popularity.

Performs well on movie review test sets

Financial Services

Bank Service Evaluation

Analyze customer feedback on banking services to identify areas for improvement.

Demonstrates good understanding of financial service-related texts

🚀 DistilCamemBERT-Sentiment

DistilCamemBERT-Sentiment is a model fine-tuned from DistilCamemBERT for French sentiment analysis. It uses two datasets to reduce bias and offers faster inference compared to CamemBERT-based models.

✨ Features

Fine-tuned for French: Specifically tailored for sentiment analysis in the French language.
Bias Minimization: Built using two diverse datasets, Amazon Reviews and Allociné.fr, to reduce bias.
Efficient Inference: Divides the inference time by two compared to CamemBERT-based models with the same power consumption.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import pipeline

analyzer = pipeline(
    task='text-classification',
    model="cmarkea/distilcamembert-base-sentiment",
    tokenizer="cmarkea/distilcamembert-base-sentiment"
)
result = analyzer(
    "J'aime me promener en forêt même si ça me donne mal aux pieds.",
    return_all_scores=True
)

result
[{'label': '1 star',
  'score': 0.047529436647892},
 {'label': '2 stars',
  'score': 0.14150355756282806},
 {'label': '3 stars',
  'score': 0.3586442470550537},
 {'label': '4 stars',
  'score': 0.3181498646736145},
 {'label': '5 stars',
  'score': 0.13417290151119232}]

Advanced Usage

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline

HUB_MODEL = "cmarkea/distilcamembert-base-sentiment"

tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL)
model = ORTModelForSequenceClassification.from_pretrained(HUB_MODEL)
onnx_qa = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Quantized onnx model
quantized_model = ORTModelForSequenceClassification.from_pretrained(
    HUB_MODEL, file_name="model_quantized.onnx"
)

📚 Documentation

Dataset

The dataset consists of 204,993 training reviews and 4,999 test reviews from Amazon, along with 235,516 and 4,729 reviews from the Allocine website. The dataset is labeled into five categories:

1 star: Represents a terrible appreciation.
2 stars: Bad appreciation.
3 stars: Neutral appreciation.
4 stars: Good appreciation.
5 stars: Excellent appreciation.

Evaluation Results

In addition to accuracy (referred to as exact accuracy here), to be robust against +/-1 star estimation errors, we use the following performance measure:

$$\mathrm{top!-!2; acc}=\frac{1}{|\mathcal{O}|}\sum_{i\in\mathcal{O}}\sum_{0\leq l < 2}\mathbb{1}(\hat{f}_{i,l}=y_i)$$

where $\hat{f}_l$ is the l-th largest predicted label, $y$ is the true label, $\mathcal{O}$ is the test set of observations, and $\mathbb{1}$ is the indicator function.

class	exact accuracy (%)	top-2 acc (%)	support
global	61.01	88.80	9,698
1 star	87.21	77.17	1,905
2 stars	79.19	84.75	1,935
3 stars	77.85	78.98	1,974
4 stars	78.61	90.22	1,952
5 stars	85.96	82.92	1,932

Benchmark

This model is compared to 3 reference models. As each model doesn't have the exact definition of targets, we detail the performance measure used for each. An AMD Ryzen 5 4500U @ 2.3GHz with 6 cores was used for the mean inference time measure.

bert-base-multilingual-uncased-sentiment

nlptown/bert-base-multilingual-uncased-sentiment is based on the BERT model in the multilingual and uncased version. This sentiment analyzer is trained on Amazon reviews, similar to our model. Hence, the targets and their definitions are the same.

model	time (ms)	exact accuracy (%)	top-2 acc (%)
cmarkea/distilcamembert-base-sentiment	95.56	61.01	88.80
nlptown/bert-base-multilingual-uncased-sentiment	187.70	54.41	82.82

tf-allociné and barthez-sentiment-classification

tblard/tf-allocine based on CamemBERT model and moussaKam/barthez-sentiment-classification based on BARThez use the same bi-class definition between them. To convert this to a two-class problem, we only consider the "1 star" and "2 stars" labels for negative sentiments and "4 stars" and "5 stars" for positive sentiments. We exclude the "3 stars" which can be interpreted as a neutral class. In this context, the problem of +/-1 star estimation errors disappears. Then we use only the classical accuracy definition.

model	time (ms)	exact accuracy (%)
cmarkea/distilcamembert-base-sentiment	95.56	97.52
tblard/tf-allocine	329.74	95.69
moussaKam/barthez-sentiment-classification	197.95	94.29

🔧 Technical Details

The model is fine-tuned from DistilCamemBERT for the French sentiment analysis task. By using two different datasets, Amazon Reviews and Allociné.fr, it aims to minimize bias. Compared to models based on CamemBERT, this model divides the inference time by two with the same power consumption, which helps in reducing technological issues during the production phase.

📄 License

This project is licensed under the MIT license.

📖 Citation

@inproceedings{delestre:hal-03674695,
  TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
  AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
  URL = {https://hal.archives-ouvertes.fr/hal-03674695},
  BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
  ADDRESS = {Vannes, France},
  YEAR = {2022},
  MONTH = Jul,
  KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
  PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
  HAL_ID = {hal-03674695},
  HAL_VERSION = {v1},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご