Sentence-flaubert-base open-source model - Free French sentence similarity algorithm, accurate and practical!

Home

Sentence Flaubert Base

Developed by Lajavaness

French sentence embedding model based on FlauBERT for calculating sentence similarity

Text Embedding

Safetensors

FrenchOpen Source License:Apache-2.0 #French sentence similarity #High-precision embeddings #Multi-benchmark testing

Downloads 1,846

Release Time : 10/25/2023

Model Overview

This model is a fine-tuned French sentence embedding model based on the pre-trained FlauBERT model, specifically designed for calculating similarity between sentences. It excels in multiple French text similarity benchmarks.

Model Features

High-performance French sentence embeddings

Achieves state-of-the-art performance in multiple French text similarity benchmarks

Based on FlauBERT pre-trained model

Utilizes FlauBERT-base-uncased as the base model for fine-tuning

Augmented SBERT method

Employs paired sampling strategy to enhance model performance

Model Capabilities

French sentence embeddings

Sentence similarity calculation

Text semantic matching

Use Cases

Text similarity

Semantic search

Used for building French semantic search engines

Improves relevance of search results

Question-answering systems

Used for matching semantic similarity between questions and answers

Enhances accuracy of QA systems

Natural language processing

Text clustering

Used for grouping semantically similar texts

Improves clustering quality

🚀 Sentence-Flaubert-Base Model

This pre-trained sentence embedding model represents the state - of - the - art in Sentence Embeddings for French. It offers high - performance sentence embeddings for various text - related tasks.

🚀 Quick Start

Pre-trained sentence embedding models are the state - of - the - art of Sentence Embeddings for French. The model is fine - tuned using pre - trained flaubert/flaubert_base_uncased and Siamese BERT - Networks with 'sentences - transformers' combined with [Augmented SBERT](https://aclanthology.org/2021.naacl - main.28.pdf) on dataset stsb along with Pair Sampling Strategies through 2 models [CrossEncoder - camembert - large](https://huggingface.co/dangvantuan/CrossEncoder - camembert - large) and [dangvantuan/sentence - camembert - large](https://huggingface.co/dangvantuan/sentence - camembert - large)

✨ Features

Advanced Fine - Tuning: Utilizes pre - trained models and advanced techniques like Augmented SBERT for fine - tuning.
High Performance: Achieves high Pearson and Spearman correlation coefficients on various benchmarks.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

The model can be used directly (without a language model) as follows:

from sentence_transformers import SentenceTransformer
model =  SentenceTransformer("Lajavaness/sentence - flaubert - base")

sentences = ["Un avion est en train de décoller.",
          "Un homme joue d'une grande flûte.",
          "Un homme étale du fromage râpé sur une pizza.",
          "Une personne jette un chat au plafond.",
          "Une personne est en train de plier un morceau de papier.",
          ]

embeddings = model.encode(sentences)

📚 Documentation

Evaluation

The model can be evaluated as follows on the French test data of stsb.

from sentence_transformers import SentenceTransformer
from sentence_transformers.readers import InputExample
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from datasets import load_dataset
def convert_dataset(dataset):
    dataset_samples = []
    for df in dataset:
        score = float(df['similarity_score'])/5.0  # Normalize score to range 0 ... 1
        inp_example = InputExample(texts=[df['sentence1'], 
                                    df['sentence2']], label=score)
        dataset_samples.append(inp_example)
    return dataset_samples

# Loading the dataset for evaluation
df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
df_test = load_dataset("stsb_multi_mt", name="fr", split="test")

# Convert the dataset for evaluation

# For Dev set:
dev_samples = convert_dataset(df_dev)
val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts - dev')
val_evaluator(model, output_path="./")

# For Test set:
test_samples = convert_dataset(df_test)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts - test')
test_evaluator(model, output_path="./")

Test Results

The performance is measured using Pearson and Spearman correlation on the sts - benchmark:

On dev

Model	Pearson correlation	Spearman correlation	#params
[Lajavaness/sentence - flaubert - base](https://huggingface.co/Lajavaness/sentence - flaubert - base)	87.14	87.10	137M
[Lajavaness/sentence - camembert - base](https://huggingface.co/Lajavaness/sentence - camembert - base)	86.88	86.73	110M
[dangvantuan/sentence - camembert - base](https://huggingface.co/dangvantuan/sentence - camembert - base)	86.73	86.54	110M
[inokufu/flaubert - base - uncased - xnli - sts](https://huggingface.co/inokufu/flaubert - base - uncased - xnli - sts)	85.85	85.71	137M
[distiluse - base - multilingual - cased](https://huggingface.co/sentence - transformers/distiluse - base - multilingual - cased)	79.22	79.16	135M

On test: Pearson and Spearman correlation are evaluated on many different benchmarks dataset

Pearson score

Model	STS - B	[STS12 - fr ](https://huggingface.co/datasets/Lajavaness/STS12 - fr)	[STS13 - fr](https://huggingface.co/datasets/Lajavaness/STS13 - fr)	[STS14 - fr](https://huggingface.co/datasets/Lajavaness/STS14 - fr)	[STS15 - fr](https://huggingface.co/datasets/Lajavaness/STS15 - fr)	[STS16 - fr](https://huggingface.co/datasets/Lajavaness/STS16 - fr)	[SICK - fr](https://huggingface.co/datasets/Lajavaness/SICK - fr)	params
[Lajavaness/sentence - flaubert - base](https://huggingface.co/Lajavaness/sentence - flaubert - base)	85.5	86.64	87.24	85.68	88.00	75.78	82.84	137M
[Lajavaness/sentence - camembert - base](https://huggingface.co/Lajavaness/sentence - camembert - base)	83.46	84.49	84.61	83.94	86.94	75.20	82.86	110M
[inokufu/flaubert - base - uncased - xnli - sts](https://huggingface.co/inokufu/flaubert - base - uncased - xnli - sts)	82.82	84.79	85.76	82.81	85.38	74.05	82.23	137M
[dangvantuan/sentence - camembert - base](https://huggingface.co/dangvantuan/sentence - camembert - base)	82.36	82.06	84.08	81.51	85.54	73.97	80.91	110M
[sentence - transformers/distiluse - base - multilingual - cased - v2](https://huggingface.co/sentence - transformers/distiluse - base - multilingual - cased)	78.63	72.51	67.25	70.12	79.93	66.67	77.76	135M
[hugorosen/flaubert_base_uncased - xnli - sts](https://huggingface.co/hugorosen/flaubert_base_uncased - xnli - sts)	78.38	79.00	77.61	76.56	79.03	71.22	80.58	137M
[antoinelouis/biencoder - camembert - base - mmarcoFR](https://huggingface.co/antoinelouis/biencoder - camembert - base - mmarcoFR)	76.97	71.43	73.50	70.56	78.44	71.23	77.62	110M

Spearman score

Model	STS - B	[STS12 - fr ](https://huggingface.co/datasets/Lajavaness/STS12 - fr)	[STS13 - fr](https://huggingface.co/datasets/Lajavaness/STS13 - fr)	[STS14 - fr](https://huggingface.co/datasets/Lajavaness/STS14 - fr)	[STS15 - fr](https://huggingface.co/datasets/Lajavaness/STS15 - fr)	[STS16 - fr](https://huggingface.co/datasets/Lajavaness/STS16 - fr)	[SICK - fr](https://huggingface.co/datasets/Lajavaness/SICK - fr)	params
[Lajavaness/sentence - flaubert - base](https://huggingface.co/Lajavaness/sentence - flaubert - base)	85.67	80.00	86.91	84.59	88.10	77.84	77.55	137M
[inokufu/flaubert - base - uncased - xnli - sts](https://huggingface.co/inokufu/flaubert - base - uncased - xnli - sts)	83.07	77.34	85.88	80.96	85.70	76.43	77.00	137M
[Lajavaness/sentence - camembert - base](https://huggingface.co/Lajavaness/sentence - camembert - base)	82.92	77.71	84.19	81.83	87.04	76.81	76.36	110M
[dangvantuan/sentence - camembert - base](https://huggingface.co/dangvantuan/sentence - camembert - base)	81.64	75.45	83.86	78.63	85.66	75.36	74.18	110M
[sentence - transformers/distiluse - base - multilingual - cased - v2](https://huggingface.co/sentence - transformers/distiluse - base - multilingual - cased)	77.49	69.80	68.85	68.17	80.27	70.04	72.49	135M
[hugorosen/flaubert_base_uncased - xnli - sts](https://huggingface.co/hugorosen/flaubert_base_uncased - xnli - sts)	76.93	68.96	77.62	71.87	79.33	72.86	73.91	137M
[antoinelouis/biencoder - camembert - base - mmarcoFR](https://huggingface.co/antoinelouis/biencoder - camembert - base - mmarcoFR)	75.55	66.89	73.90	67.14	78.78	72.64	72.03	110M

📄 License

The model is licensed under the apache - 2.0 license.

📚 Citation

@article{reimers2019sentence,
    title={Sentence - BERT: Sentence Embeddings using Siamese BERT - Networks},
    author={Nils Reimers, Iryna Gurevych},
    journal={https://arxiv.org/abs/1908.10084},
    year={2019}
}

@article{martin2020camembert,
    title={CamemBERT: a Tasty French Language Mode},
    author={Martin, Louis and Muller, Benjamin and Suárez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, Éric Villemonte and Seddah, Djamé and Sagot, Benoît},
    journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
    year={2020}
}

@article{thakur2020augmented,
    title={Augmented SBERT: Data Augmentation Method for Improving Bi - Encoders for Pairwise Sentence Scoring Tasks},
    author={Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna},
    journal={arXiv e - prints},
    pages={arXiv--2010},
    year={2020}
}

Property	Details
Model Type	Sentence Embedding Model
Training Data	stsb_multi_mt
License	apache - 2.0
Library Name	sentence - transformers

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご