Sentence-Camembert-Large: Open-Source French Sentence Embedding Model - Free Powerful Semantic Search Capabilities Offered

Sentence Camembert Large

Developed by Lajavaness

French sentence embedding model based on CamemBERT-large, providing powerful semantic search capabilities

Text Embedding FrenchOpen Source License:Apache-2.0 #French sentence embeddings #Semantic search #High-precision similarity

Downloads 3,729

Release Time : 10/25/2023

Model Overview

This model aims to represent the content and semantics of French sentences as mathematical vectors, enabling it to understand the meaning of text in queries and documents, not just individual words.

Model Features

Powerful semantic understanding

Capable of understanding the deep semantics of French sentences, not just surface vocabulary

Improved robustness

Outperforms the base version on all STS benchmark datasets

Augmented SBERT training

Uses paired sampling strategy to enhance model performance

Model Capabilities

French sentence embeddings

Semantic similarity calculation

Semantic search

Use Cases

Information retrieval

Semantic search

Document retrieval based on semantics rather than keyword matching

Improves search relevance and accuracy

Text analysis

Sentence similarity calculation

Calculates semantic similarity between two French sentences

Pearson correlation coefficient reaches 88.63

🚀 Sentence-CamemBERT-Large

This model is an embedding model for French, capable of representing sentence semantics as vectors and offering powerful semantic search capabilities.

✨ Features

Sentence Embedding: Represents the content and semantics of French sentences as mathematical vectors, enabling understanding of text meaning beyond individual words.
Enhanced Performance: An improvement over the dangvantuan/sentence-camembert-base, offering greater robustness and better performance on all STS benchmark datasets.
Fine - Tuned: Fine - tuned using the pre - trained facebook/camembert-large and Siamese BERT - Networks with 'sentences - transformers' on dataset stsb. Also combined with Augmented SBERT on the same dataset.
Pair Sampling Strategies: Benefits from pair sampling strategies using two models: CrossEncoder-camembert-large and dangvantuan/sentence-camembert-large.

📦 Model Information

Property	Details
Pipeline Tag	sentence-similarity
Language	fr
Datasets	stsb_multi_mt
Tags	Text, Sentence Similarity, Sentence - Embedding, camembert - large
License	apache - 2.0
Library Name	sentence - transformers
Model Name	sentence - camembert - large by Van Tuan DANG
Task	Sentence - Embedding (Text Similarity)
Dataset for Evaluation	Text Similarity fr (stsb_multi_mt, args: fr)
Metric (Test Pearson correlation coefficient)	88.63

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
model =  SentenceTransformer("Lajavaness/sentence-camembert-large")

sentences = ["Un avion est en train de décoller.",
          "Un homme joue d'une grande flûte.",
          "Un homme étale du fromage râpé sur une pizza.",
          "Une personne jette un chat au plafond.",
          "Une personne est en train de plier un morceau de papier.",
          ]

embeddings = model.encode(sentences)

Advanced Usage (Evaluation)

from sentence_transformers import SentenceTransformer
from sentence_transformers.readers import InputExample
from datasets import load_dataset
def convert_dataset(dataset):
    dataset_samples=[]
    for df in dataset:
        score = float(df['similarity_score'])/5.0  # Normalize score to range 0 ... 1
        inp_example = InputExample(texts=[df['sentence1'], 
                                    df['sentence2']], label=score)
        dataset_samples.append(inp_example)
    return dataset_samples

# Loading the dataset for evaluation
df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
df_test = load_dataset("stsb_multi_mt", name="fr", split="test")

# Convert the dataset for evaluation

# For Dev set:
dev_samples = convert_dataset(df_dev)
val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
val_evaluator(model, output_path="./")

# For Test set:
test_samples = convert_dataset(df_test)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(model, output_path="./")

📚 Evaluation Results

Dev Set

Model	Pearson correlation	Spearman correlation	#params
Lajavaness/sentence-camembert-large	88.63	88.46	336M
dangvantuan/sentence-camembert-large	88.2	88.02	336M
Sahajtomar/french_semanti	87.44	87.30	336M
Lajavaness/sentence-flaubert-base	87.14	87.10	137M
GPT-3 (text-davinci-003)	85	NaN	175B
GPT-(text-embedding-ada-002)	79.75	80.44	NaN

Test Set - Pearson Score

Model	STS-B	STS12-fr	STS13-fr	STS14-fr	STS15-fr	STS16-fr	SICK-fr	params
Lajavaness/sentence-camembert-large	86.26	87.42	89.34	88.05	88.91	77.15	83.13	336M
dangvantuan/sentence-camembert-large	85.88	87.28	89.25	87.91	88.54	76.90	83.26	336M
Sahajtomar/french_semantic	85.80	86.05	88.50	86.57	87.49	77.85	83.27	336M
Lajavaness/sentence-flaubert-base	85.39	86.64	87.24	85.68	87.99	75.78	82.84	137M
GPT3 (text-embedding-ada-002)	79.03	66.16	75.48	70.69	77.88	65.18	-	-

Test Set - Spearman Score

Model	STS-B	STS12-fr	STS13-fr	STS14-fr	STS15-fr	STS16-fr	SICK-fr	params
Lajavaness/sentence-camembert-large	86.14	81.22	88.61	86.28	89.01	78.65	77.71	336M
dangvantuan/sentence-camembert-large	85.78	81.09	88.68	85.81	88.56	78.49	77.70	336M
Sahajtomar/french_semantic	85.55	77.92	87.85	83.96	87.63	79.07	77.14	336M
Lajavaness/sentence-flaubert-base	85.67	79.97	86.91	84.57	88.10	77.84	77.55	137M
GPT3 (text-embedding-ada-002)	77.53	64.27	76.41	69.63	78.65	75.30	-	-

📄 License

This project is licensed under the Apache 2.0 License.

📚 Citation

@article{reimers2019sentence,
   title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
   author={Nils Reimers, Iryna Gurevych},
   journal={https://arxiv.org/abs/1908.10084},
   year={2019}
}

@article{martin2020camembert,
   title={CamemBERT: a Tasty French Language Mode},
   author={Martin, Louis and Muller, Benjamin and Suárez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, Éric Villemonte and Seddah, Djamé and Sagot, Benoît},
   journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
   year={2020}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご