Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Sentence-Flaubert-Base Model
This pre-trained sentence embedding model represents the state - of - the - art in Sentence Embeddings for French. It offers high - performance sentence embeddings for various text - related tasks.
🚀 Quick Start
Pre-trained sentence embedding models are the state - of - the - art of Sentence Embeddings for French. The model is fine - tuned using pre - trained flaubert/flaubert_base_uncased and Siamese BERT - Networks with 'sentences - transformers' combined with [Augmented SBERT](https://aclanthology.org/2021.naacl - main.28.pdf) on dataset stsb along with Pair Sampling Strategies through 2 models [CrossEncoder - camembert - large](https://huggingface.co/dangvantuan/CrossEncoder - camembert - large) and [dangvantuan/sentence - camembert - large](https://huggingface.co/dangvantuan/sentence - camembert - large)
✨ Features
- Advanced Fine - Tuning: Utilizes pre - trained models and advanced techniques like Augmented SBERT for fine - tuning.
- High Performance: Achieves high Pearson and Spearman correlation coefficients on various benchmarks.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
The model can be used directly (without a language model) as follows:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Lajavaness/sentence - flaubert - base")
sentences = ["Un avion est en train de décoller.",
"Un homme joue d'une grande flûte.",
"Un homme étale du fromage râpé sur une pizza.",
"Une personne jette un chat au plafond.",
"Une personne est en train de plier un morceau de papier.",
]
embeddings = model.encode(sentences)
📚 Documentation
Evaluation
The model can be evaluated as follows on the French test data of stsb.
from sentence_transformers import SentenceTransformer
from sentence_transformers.readers import InputExample
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from datasets import load_dataset
def convert_dataset(dataset):
dataset_samples = []
for df in dataset:
score = float(df['similarity_score'])/5.0 # Normalize score to range 0 ... 1
inp_example = InputExample(texts=[df['sentence1'],
df['sentence2']], label=score)
dataset_samples.append(inp_example)
return dataset_samples
# Loading the dataset for evaluation
df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
df_test = load_dataset("stsb_multi_mt", name="fr", split="test")
# Convert the dataset for evaluation
# For Dev set:
dev_samples = convert_dataset(df_dev)
val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts - dev')
val_evaluator(model, output_path="./")
# For Test set:
test_samples = convert_dataset(df_test)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts - test')
test_evaluator(model, output_path="./")
Test Results
The performance is measured using Pearson and Spearman correlation on the sts - benchmark:
On dev
Model | Pearson correlation | Spearman correlation | #params |
---|---|---|---|
[Lajavaness/sentence - flaubert - base](https://huggingface.co/Lajavaness/sentence - flaubert - base) | 87.14 | 87.10 | 137M |
[Lajavaness/sentence - camembert - base](https://huggingface.co/Lajavaness/sentence - camembert - base) | 86.88 | 86.73 | 110M |
[dangvantuan/sentence - camembert - base](https://huggingface.co/dangvantuan/sentence - camembert - base) | 86.73 | 86.54 | 110M |
[inokufu/flaubert - base - uncased - xnli - sts](https://huggingface.co/inokufu/flaubert - base - uncased - xnli - sts) | 85.85 | 85.71 | 137M |
[distiluse - base - multilingual - cased](https://huggingface.co/sentence - transformers/distiluse - base - multilingual - cased) | 79.22 | 79.16 | 135M |
On test: Pearson and Spearman correlation are evaluated on many different benchmarks dataset
Pearson score
Model | STS - B | [STS12 - fr ](https://huggingface.co/datasets/Lajavaness/STS12 - fr) | [STS13 - fr](https://huggingface.co/datasets/Lajavaness/STS13 - fr) | [STS14 - fr](https://huggingface.co/datasets/Lajavaness/STS14 - fr) | [STS15 - fr](https://huggingface.co/datasets/Lajavaness/STS15 - fr) | [STS16 - fr](https://huggingface.co/datasets/Lajavaness/STS16 - fr) | [SICK - fr](https://huggingface.co/datasets/Lajavaness/SICK - fr) | params |
---|---|---|---|---|---|---|---|---|
[Lajavaness/sentence - flaubert - base](https://huggingface.co/Lajavaness/sentence - flaubert - base) | 85.5 | 86.64 | 87.24 | 85.68 | 88.00 | 75.78 | 82.84 | 137M |
[Lajavaness/sentence - camembert - base](https://huggingface.co/Lajavaness/sentence - camembert - base) | 83.46 | 84.49 | 84.61 | 83.94 | 86.94 | 75.20 | 82.86 | 110M |
[inokufu/flaubert - base - uncased - xnli - sts](https://huggingface.co/inokufu/flaubert - base - uncased - xnli - sts) | 82.82 | 84.79 | 85.76 | 82.81 | 85.38 | 74.05 | 82.23 | 137M |
[dangvantuan/sentence - camembert - base](https://huggingface.co/dangvantuan/sentence - camembert - base) | 82.36 | 82.06 | 84.08 | 81.51 | 85.54 | 73.97 | 80.91 | 110M |
[sentence - transformers/distiluse - base - multilingual - cased - v2](https://huggingface.co/sentence - transformers/distiluse - base - multilingual - cased) | 78.63 | 72.51 | 67.25 | 70.12 | 79.93 | 66.67 | 77.76 | 135M |
[hugorosen/flaubert_base_uncased - xnli - sts](https://huggingface.co/hugorosen/flaubert_base_uncased - xnli - sts) | 78.38 | 79.00 | 77.61 | 76.56 | 79.03 | 71.22 | 80.58 | 137M |
[antoinelouis/biencoder - camembert - base - mmarcoFR](https://huggingface.co/antoinelouis/biencoder - camembert - base - mmarcoFR) | 76.97 | 71.43 | 73.50 | 70.56 | 78.44 | 71.23 | 77.62 | 110M |
Spearman score
Model | STS - B | [STS12 - fr ](https://huggingface.co/datasets/Lajavaness/STS12 - fr) | [STS13 - fr](https://huggingface.co/datasets/Lajavaness/STS13 - fr) | [STS14 - fr](https://huggingface.co/datasets/Lajavaness/STS14 - fr) | [STS15 - fr](https://huggingface.co/datasets/Lajavaness/STS15 - fr) | [STS16 - fr](https://huggingface.co/datasets/Lajavaness/STS16 - fr) | [SICK - fr](https://huggingface.co/datasets/Lajavaness/SICK - fr) | params |
---|---|---|---|---|---|---|---|---|
[Lajavaness/sentence - flaubert - base](https://huggingface.co/Lajavaness/sentence - flaubert - base) | 85.67 | 80.00 | 86.91 | 84.59 | 88.10 | 77.84 | 77.55 | 137M |
[inokufu/flaubert - base - uncased - xnli - sts](https://huggingface.co/inokufu/flaubert - base - uncased - xnli - sts) | 83.07 | 77.34 | 85.88 | 80.96 | 85.70 | 76.43 | 77.00 | 137M |
[Lajavaness/sentence - camembert - base](https://huggingface.co/Lajavaness/sentence - camembert - base) | 82.92 | 77.71 | 84.19 | 81.83 | 87.04 | 76.81 | 76.36 | 110M |
[dangvantuan/sentence - camembert - base](https://huggingface.co/dangvantuan/sentence - camembert - base) | 81.64 | 75.45 | 83.86 | 78.63 | 85.66 | 75.36 | 74.18 | 110M |
[sentence - transformers/distiluse - base - multilingual - cased - v2](https://huggingface.co/sentence - transformers/distiluse - base - multilingual - cased) | 77.49 | 69.80 | 68.85 | 68.17 | 80.27 | 70.04 | 72.49 | 135M |
[hugorosen/flaubert_base_uncased - xnli - sts](https://huggingface.co/hugorosen/flaubert_base_uncased - xnli - sts) | 76.93 | 68.96 | 77.62 | 71.87 | 79.33 | 72.86 | 73.91 | 137M |
[antoinelouis/biencoder - camembert - base - mmarcoFR](https://huggingface.co/antoinelouis/biencoder - camembert - base - mmarcoFR) | 75.55 | 66.89 | 73.90 | 67.14 | 78.78 | 72.64 | 72.03 | 110M |
📄 License
The model is licensed under the apache - 2.0 license.
📚 Citation
@article{reimers2019sentence,
title={Sentence - BERT: Sentence Embeddings using Siamese BERT - Networks},
author={Nils Reimers, Iryna Gurevych},
journal={https://arxiv.org/abs/1908.10084},
year={2019}
}
@article{martin2020camembert,
title={CamemBERT: a Tasty French Language Mode},
author={Martin, Louis and Muller, Benjamin and Suárez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, Éric Villemonte and Seddah, Djamé and Sagot, Benoît},
journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year={2020}
}
@article{thakur2020augmented,
title={Augmented SBERT: Data Augmentation Method for Improving Bi - Encoders for Pairwise Sentence Scoring Tasks},
author={Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna},
journal={arXiv e - prints},
pages={arXiv--2010},
year={2020}
}
Property | Details |
---|---|
Model Type | Sentence Embedding Model |
Training Data | stsb_multi_mt |
License | apache - 2.0 |
Library Name | sentence - transformers |





