🚀 Sentence CamemBERT Base
This project provides a pre - trained sentence embedding model for French, achieving state - of - the - art performance in sentence embeddings.
🚀 Quick Start
Pre - trained sentence embedding models represent the state - of - the - art in Sentence Embeddings for French. This model is fine - tuned using the pre - trained facebook/camembert-base and Siamese BERT - Networks with 'sentences - transformers' on the dataset stsb.
✨ Features
- Pipeline Tag: Sentence - similarity
- Language: French
- Datasets: stsb_multi_mt
- Tags: Text, Sentence Similarity, Sentence - Embedding, camembert - base
- License: apache - 2.0
- Library Name: sentence - transformers
Property |
Details |
Model Type |
sentence - camembert - base by Van Tuan DANG |
Training Data |
stsb_multi_mt (French) |
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
The model can be used directly (without a language model) as follows:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("dangvantuan/sentence-camembert-base")
sentences = ["Un avion est en train de décoller.",
"Un homme joue d'une grande flûte.",
"Un homme étale du fromage râpé sur une pizza.",
"Une personne jette un chat au plafond.",
"Une personne est en train de plier un morceau de papier.",
]
embeddings = model.encode(sentences)
Advanced Usage
The model can be evaluated as follows on the French test data of stsb.
from sentence_transformers import SentenceTransformer
from sentence_transformers.readers import InputExample
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from datasets import load_dataset
def convert_dataset(dataset):
dataset_samples=[]
for df in dataset:
score = float(df['similarity_score'])/5.0
inp_example = InputExample(texts=[df['sentence1'],
df['sentence2']], label=score)
dataset_samples.append(inp_example)
return dataset_samples
df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
df_test = load_dataset("stsb_multi_mt", name="fr", split="test")
dev_samples = convert_dataset(df_dev)
val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
val_evaluator(model, output_path="./")
test_samples = convert_dataset(df_test)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(model, output_path="./")
📚 Documentation
Evaluation Results
The performance is measured using Pearson and Spearman correlation:
📄 License
This project is licensed under the apache - 2.0 license.
📚 Citation
@article{reimers2019sentence,
title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
author={Nils Reimers, Iryna Gurevych},
journal={https://arxiv.org/abs/1908.10084},
year={2019}
}
@article{martin2020camembert,
title={CamemBERT: a Tasty French Language Mode},
author={Martin, Louis and Muller, Benjamin and Suárez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, Éric Villemonte and Seddah, Djamé and Sagot, Benoît},
journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year={2020}
}