🚀 Sentence-Transformers
This project is a feature extraction model based on sentence-transformers, which can encode sentences or paragraphs into 768-dimensional vectors, enabling tasks such as sentence similarity calculation.
🚀 Quick Start
This model can be used through the sentence-transformers
or transformers
libraries. Here are the installation and usage examples for both libraries.
Installation via sentence-transformers
pip install -U sentence-transformers
Usage via sentence-transformers
from sentence_transformers import SentenceTransformer
sentences = ["Ceci est un exemple", "deuxième exemple"]
model = SentenceTransformer('h4c5/sts-distilcamembert-base')
embeddings = model.encode(sentences)
print(embeddings)
Installation via transformers
pip install -U transformers
Usage via transformers
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("h4c5/sts-distilcamembert-base")
model = AutoModel.from_pretrained("h4c5/sts-distilcamembert-base")
model.eval()
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[
0
]
input_mask_expanded = (
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
)
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
input_mask_expanded.sum(1), min=1e-9
)
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
print(sentence_embeddings)
✨ Features
- This model is based on the
cmarkea/distilcamembert-base
model and is fine-tuned using the sentence-transformers
library.
- It can encode sentences or paragraphs (up to 514 tokens) into 768-dimensional vectors.
- The underlying
DistilCamemBERT
model is a distilled version of CamemBERT
, which reduces the number of model parameters by half and improves inference time.
📦 Installation
You can install the necessary libraries using the following commands:
- For
sentence-transformers
:
pip install -U sentence-transformers
pip install -U transformers
💻 Usage Examples
The above quick start section already provides basic usage examples. Here is an example of evaluating the model:
Model Evaluation
The model was evaluated on the STSb fr dataset:
from datasets import load_dataset
from sentence_transformers import InputExample, evaluation
def dataset_to_input_examples(dataset):
return [
InputExample(
texts=[example["sentence1"], example["sentence2"]],
label=example["similarity_score"] / 5.0,
)
for example in dataset
]
sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test")
sts_test_examples = dataset_to_input_examples(sts_test_dataset)
sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
sts_test_examples, name="sts-test"
)
sts_test_evaluator(model, ".")
Evaluation Results
The following are the evaluation results of the model on the stsb_multi_mt
dataset (French data, test split):
🔧 Technical Details
Training Parameters
The model was trained with the following parameters:
- DataLoader:
torch.utils.data.dataloader.DataLoader
of length 180 with parameters:
{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
- Loss:
sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss
- Parameters of the
fit()
method:
{
"epochs": 10,
"evaluation_steps": 1000,
"evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 500,
"weight_decay": 0.01
}
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: CamembertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
📄 License
This project is licensed under the MIT license.
📚 Documentation
Citing
If you use this model, please cite the following papers:
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
journal={"https://arxiv.org/abs/1908.10084"},
}
@inproceedings{sanh2019distilbert,
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
booktitle={NeurIPS EMC^2 Workshop},
journal={https://arxiv.org/abs/1910.01108},
year={2019}
}
@inproceedings{martin2020camembert,
title={CamemBERT: a Tasty French Language Model},
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
journal={https://arxiv.org/abs/1911.03894},
year={2020}
}
@inproceedings{delestre:hal-03674695,
TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
URL = {https://hal.archives-ouvertes.fr/hal-03674695},
BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
ADDRESS = {Vannes, France},
YEAR = {2022},
MONTH = Jul,
KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
HAL_ID = {hal-03674695},
HAL_VERSION = {v1},
journal={https://arxiv.org/abs/2205.11111},
}