🚀 CrossEncoder-camembert-large
A cross-encoder model for sentence similarity, offering enhanced robustness and performance.
🚀 Quick Start
This model is an improved version of dangvantuan/CrossEncoder-camembert-large, providing greater robustness and better performance.
✨ Features
- Enhanced Performance: Offers better robustness and performance compared to its predecessor.
- Semantic Similarity Prediction: Predicts a score between 0 and 1 for the semantic similarity of two sentences.
📦 Installation
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import CrossEncoder
model = CrossEncoder('Lajavaness/CrossEncoder-camembert-large', max_length=512)
scores = model.predict([('Un avion est en train de décoller.', "Un homme joue d'une grande flûte."), ("Un homme étale du fromage râpé sur une pizza.", "Une personne jette un chat au plafond") ])
📚 Documentation
Model
This is a cross-encoder model for sentence similarity. It is an improvement over the dangvantuan/CrossEncoder-camembert-large model, offering greater robustness and better performance.
Training Data
This model was trained on the STS benchmark dataset and combined with Augmented SBERT. It benefits from Pair Sampling Strategies using two models: CrossEncoder-camembert-large and dangvantuan/sentence-camembert-large. The model predicts a score between 0 and 1 for the semantic similarity of two sentences.
Evaluation
The model can be evaluated as follows on the French test data of stsb:
from sentence_transformers.readers import InputExample
from sentence_transformers.cross_encoder.evaluation import CECorrelationEvaluator
from datasets import load_dataset
def convert_dataset(dataset):
dataset_samples=[]
for df in dataset:
score = float(df['similarity_score'])/5.0
inp_example = InputExample(texts=[df['sentence1'],
df['sentence2']], label=score)
dataset_samples.append(inp_example)
return dataset_samples
df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
df_test = load_dataset("stsb_multi_mt", name="fr", split="test")
dev_samples = convert_dataset(df_dev)
val_evaluator = CECorrelationEvaluator.from_input_examples(dev_samples, name='sts-dev')
val_evaluator(model, output_path="./")
test_samples = convert_dataset(df_test)
test_evaluator = CECorrelationEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(models, output_path="./")
Test Result:
The performance is measured using Pearson and Spearman correlation:
Pearson score
Spearman score
📄 License
This model is licensed under the apache-2.0 license.
Additional Information
Property |
Details |
Model Type |
Cross-Encoder Model for sentence-similarity |
Training Data |
STS benchmark dataset combined with Augmented SBERT |
Pipeline Tag |
text-ranking |
Language |
fr |
Datasets |
stsb_multi_mt |
Tags |
Text, Sentence Similarity, Sentence-Embedding, camembert-base |
Model Name |
CrossEncoder-camembert-large by Van Tuan DANG |
Results Task |
Text Similarity (Sentence-Embedding) |
Results Dataset |
Text Similarity fr (stsb_multi_mt, args: fr) |
Results Metrics |
Pearson_correlation_coefficient (Test Pearson correlation coefficient: 90.34) |