CrossEncoder-camembert-large Open-source French Model - Accurately Calculate Sentence Similarity with Excellent Performance

Crossencoder Camembert Large

Developed by Lajavaness

This is a French sentence similarity calculation model based on CamemBERT, improved from dangvantuan/CrossEncoder-camembert-large, with stronger robustness and better performance.

Text Embedding

Transformers

FrenchOpen Source License:Apache-2.0 #French Semantic Similarity #High-Precision Text Ranking #Cross-Sentence Encoder

Downloads 129

Release Time : 10/25/2023

Model Overview

This model is used to calculate the semantic similarity between two French sentences, outputting a score between 0 and 1. Trained on the STS benchmark dataset and incorporating enhanced SBERT technology.

Model Features

Improved Performance

Compared to the original model, it shows higher Pearson and Spearman correlation coefficients on multiple French STS test sets.

Enhanced Robustness

Improved training strategies and model architecture enhance the model's stability and generalization capabilities.

Semantic Understanding

Accurately captures semantic relationships between French sentences and outputs refined similarity scores.

Model Capabilities

French Sentence Similarity Calculation

Semantic Relationship Analysis

Text Pair Scoring

Use Cases

Information Retrieval

Search Result Ranking

Re-rank search results based on the semantic similarity between queries and documents.

Improves the relevance of search results.

Question Answering Systems

Answer Selection

Select the answer that best matches the semantic meaning of the question from candidate answers.

Improves the accuracy of QA systems.

Text Matching

Duplicate Question Detection

Identify duplicate questions on community Q&A platforms.

Reduces redundant content and improves platform quality.

🚀 CrossEncoder-camembert-large

A cross-encoder model for sentence similarity, offering enhanced robustness and performance.

🚀 Quick Start

This model is an improved version of dangvantuan/CrossEncoder-camembert-large, providing greater robustness and better performance.

✨ Features

Enhanced Performance: Offers better robustness and performance compared to its predecessor.
Semantic Similarity Prediction: Predicts a score between 0 and 1 for the semantic similarity of two sentences.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import CrossEncoder
model = CrossEncoder('Lajavaness/CrossEncoder-camembert-large', max_length=512)
scores = model.predict([('Un avion est en train de décoller.', "Un homme joue d'une grande flûte."), ("Un homme étale du fromage râpé sur une pizza.", "Une personne jette un chat au plafond") ])

📚 Documentation

Model

This is a cross-encoder model for sentence similarity. It is an improvement over the dangvantuan/CrossEncoder-camembert-large model, offering greater robustness and better performance.

Training Data

This model was trained on the STS benchmark dataset and combined with Augmented SBERT. It benefits from Pair Sampling Strategies using two models: CrossEncoder-camembert-large and dangvantuan/sentence-camembert-large. The model predicts a score between 0 and 1 for the semantic similarity of two sentences.

Evaluation

The model can be evaluated as follows on the French test data of stsb:

from sentence_transformers.readers import InputExample
from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator
from datasets import load_dataset
def convert_dataset(dataset):
    dataset_samples=[]
    for df in dataset:
        score = float(df['similarity_score'])/5.0  # Normalize score to range 0 ... 1
        inp_example = InputExample(texts=[df['sentence1'], 
                                    df['sentence2']], label=score)
        dataset_samples.append(inp_example)
    return dataset_samples

# Loading the dataset for evaluation
df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
df_test = load_dataset("stsb_multi_mt", name="fr", split="test")

# Convert the dataset for evaluation

# For Dev set:
dev_samples = convert_dataset(df_dev)
val_evaluator = CECorrelationEvaluator.from_input_examples(dev_samples, name='sts-dev')
val_evaluator(model, output_path="./")

# For Test set, the Pearson and Spearman correlation are evaluated on many different benchmark datasets:

test_samples = convert_dataset(df_test)
test_evaluator = CECorrelationEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(models, output_path="./")

Test Result: The performance is measured using Pearson and Spearman correlation:

On dev

Model	Pearson correlation	Spearman correlation	#params
Lajavaness/CrossEncoder-camembert-large	90.34	90.15	336M
dangvantuan/CrossEncoder-camembert-large	90.11	90.01	336M

On test:

Pearson score

Model	STS-B	STS12-fr	STS13-fr	STS14-fr	STS15-fr	STS16-fr	SICK-fr
Lajavaness/CrossEncoder-camembert-large	88.63	90.76	88.24	90.22	92.23	82.31	84.61
dangvantuan/CrossEncoder-camembert-large	88.16	90.12	88.36	89.86	92.04	82.01	84.23

Spearman score

Model	STS-B	STS12-fr	STS13-fr	STS14-fr	STS15-fr	STS16-fr	SICK-fr
Lajavaness/CrossEncoder-camembert-large	88.03	84.87	87.88	89.10	92.16	82.50	80.78
dangvantuan/CrossEncoder-camembert-large	87.57	84.24	88.01	88.62	91.99	82.16	80.38

📄 License

This model is licensed under the apache-2.0 license.

Additional Information

Property	Details
Model Type	Cross-Encoder Model for sentence-similarity
Training Data	STS benchmark dataset combined with Augmented SBERT
Pipeline Tag	text-ranking
Language	fr
Datasets	stsb_multi_mt
Tags	Text, Sentence Similarity, Sentence-Embedding, camembert-base
Model Name	CrossEncoder-camembert-large by Van Tuan DANG
Results Task	Text Similarity (Sentence-Embedding)
Results Dataset	Text Similarity fr (stsb_multi_mt, args: fr)
Results Metrics	Pearson_correlation_coefficient (Test Pearson correlation coefficient: 90.34)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご