CrossEncoder-camembert-large Open-source Model - Free Calculation of Semantic Similarity Scores for French Sentences

Crossencoder Camembert Large

Developed by dangvantuan

This is a French sentence similarity calculation model based on CamemBERT, used to predict the semantic similarity score between two sentences.

Text Embedding

Transformers

FrenchOpen Source License:Apache-2.0 #French Semantic Similarity #High-Precision Ranking #Sentence Pair Scoring

Downloads 167

Release Time : 3/28/2022

Model Overview

This model is trained using the Cross-Encoder architecture, specifically designed for calculating semantic similarity between French sentence pairs, outputting a similarity score between 0 and 1.

Model Features

Efficient Sentence Similarity Calculation

Specially optimized for French sentence pair similarity calculation tasks

Based on CamemBERT-large

Uses the powerful French pre-trained model CamemBERT-large as the base architecture

High Accuracy

Achieves a Pearson correlation coefficient of 88.16 on French STS test sets

Model Capabilities

French Sentence Similarity Calculation

Semantic Relevance Scoring

Text Pair Comparison

Use Cases

Text Matching

Q&A Systems

Evaluate the matching degree between user questions and candidate answers

Improves the accuracy of Q&A systems

Information Retrieval

Re-rank search results to improve relevance

Enhances search result quality

Natural Language Processing

Text Deduplication

Identify semantically similar text content

Effectively reduces duplicate content

🚀 Sentence-Camembert-Base Model

A cross-encoder model for sentence similarity, trained to predict semantic similarity scores between sentences.

🚀 Quick Start

This model is a cross-encoder for sentence similarity, trained using the sentence-transformers Cross-Encoder class. It predicts a score between 0 and 1 for the semantic similarity of two sentences.

✨ Features

Text Ranking: Ideal for text ranking tasks.
Sentence Similarity: Accurately predicts the semantic similarity between sentences.
Sentence Embedding: Generates effective sentence embeddings.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import CrossEncoder
model = CrossEncoder('dangvantuan/CrossEncoder-camembert-large', max_length=128)
scores = model.predict([('Un avion est en train de décoller.', "Un homme joue d'une grande flûte."), ("Un homme étale du fromage râpé sur une pizza.", "Une personne jette un chat au plafond") ])

📚 Documentation

Model

Cross-Encoder for sentence-similarity. This model was trained using sentence-transformers Cross-Encoder class.

Training Data

This model was trained on the STS benchmark dataset. The model will predict a score between 0 and 1 for the semantic similarity of two sentences.

Evaluation

The model can be evaluated as follows on the French test data of stsb.

from sentence_transformers.readers import InputExample
from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator
from datasets import load_dataset
def convert_dataset(dataset):
    dataset_samples=[]
    for df in dataset:
        score = float(df['similarity_score'])/5.0  # Normalize score to range 0 ... 1
        inp_example = InputExample(texts=[df['sentence1'], 
                                    df['sentence2']], label=score)
        dataset_samples.append(inp_example)
    return dataset_samples

# Loading the dataset for evaluation
df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
df_test = load_dataset("stsb_multi_mt", name="fr", split="test")

# Convert the dataset for evaluation

# For Dev set:
dev_samples = convert_dataset(df_dev)
val_evaluator = CECorrelationEvaluator.from_input_examples(dev_samples, name='sts-dev')
val_evaluator(model, output_path="./")

# For Test set

test_samples = convert_dataset(df_test)
test_evaluator = CECorrelationEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(models, output_path="./")

Test Result

The performance is measured using Pearson and Spearman correlation:

On dev | Model | Pearson correlation | Spearman correlation | #params | | ------------- | ------------- | ------------- |------------- | | dangvantuan/CrossEncoder-camembert-large| 90.11 |90.01 | 336M |
On test | Model | Pearson correlation | Spearman correlation | | ------------- | ------------- | ------------- | | dangvantuan/CrossEncoder-camembert-large| 88.16 | 87.57|

📄 License

This model is licensed under the Apache 2.0 license.

📋 Model Information

Property	Details
Pipeline Tag	Text Ranking
Language	French
Datasets	stsb_multi_mt
Tags	Text, Sentence Similarity, Sentence-Embedding, camembert-base
Model Name	sentence-camembert-base by Van Tuan DANG
Results	Task: Text Similarity (Sentence-Embedding), Dataset: Text Similarity fr (stsb_multi_mt, args: fr), Metric: Pearson_correlation_coefficient (Test Pearson correlation coefficient: xx.xx)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご