sts-distilcamembert-base Open Source Model - Free French Sentence Encoding for Similarity Calculation

Sts Distilcamembert Base

Developed by h4c5

This is a French sentence embedding model based on DistilCamemBERT, capable of encoding sentences or paragraphs into 768-dimensional vectors for tasks such as sentence similarity computation.

Text Embedding

Transformers

FrenchOpen Source License:MIT #French sentence embeddings #Efficient distilled model #Sentence similarity computation

Downloads 48

Release Time : 2/26/2024

Model Overview

This model is obtained by fine-tuning the DistilCamemBERT base model using the sentence-transformers library, specifically designed for French sentence similarity computation and feature extraction tasks.

Model Features

Efficient distilled model

Based on DistilCamemBERT, the number of parameters is halved, inference time is shorter, while maintaining good performance.

French sentence embeddings

Optimized specifically for French text, capable of generating high-quality sentence embeddings.

High similarity computation accuracy

Achieves a Pearson correlation coefficient of 0.8165 on the STSb French dataset, demonstrating excellent performance.

Model Capabilities

French sentence embeddings

Sentence similarity computation

Text feature extraction

Use Cases

Text similarity

Semantic search

Can be used to build a French semantic search engine, returning results based on the semantic similarity between queries and documents.

Duplicate content detection

Identify text content with different expressions but similar semantics for content deduplication.

Information retrieval

Document clustering

Perform clustering analysis on French documents based on sentence embeddings.

🚀 Sentence-Transformers

This project is a feature extraction model based on sentence-transformers, which can encode sentences or paragraphs into 768-dimensional vectors, enabling tasks such as sentence similarity calculation.

🚀 Quick Start

This model can be used through the sentence-transformers or transformers libraries. Here are the installation and usage examples for both libraries.

Installation via `sentence-transformers`

pip install -U sentence-transformers

Usage via `sentence-transformers`

from sentence_transformers import SentenceTransformer
sentences = ["Ceci est un exemple", "deuxième exemple"]

model = SentenceTransformer('h4c5/sts-distilcamembert-base')
embeddings = model.encode(sentences)
print(embeddings)

Installation via `transformers`

pip install -U transformers

Usage via `transformers`

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("h4c5/sts-distilcamembert-base")
model = AutoModel.from_pretrained("h4c5/sts-distilcamembert-base")
model.eval()


# Mean Pooling
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[
        0
    ]  # First element of model_output contains all token embeddings
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )

# Tokenization et calcul des embeddings des tokens
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
model_output = model(**encoded_input)

# Mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])

print(sentence_embeddings)

✨ Features

This model is based on the cmarkea/distilcamembert-base model and is fine-tuned using the sentence-transformers library.
It can encode sentences or paragraphs (up to 514 tokens) into 768-dimensional vectors.
The underlying DistilCamemBERT model is a distilled version of CamemBERT, which reduces the number of model parameters by half and improves inference time.

📦 Installation

You can install the necessary libraries using the following commands:

For sentence-transformers:

pip install -U sentence-transformers

For transformers:

pip install -U transformers

💻 Usage Examples

The above quick start section already provides basic usage examples. Here is an example of evaluating the model:

Model Evaluation

The model was evaluated on the STSb fr dataset:

from datasets import load_dataset
from sentence_transformers import InputExample, evaluation


def dataset_to_input_examples(dataset):
    return [
        InputExample(
            texts=[example["sentence1"], example["sentence2"]],
            label=example["similarity_score"] / 5.0,
        )
        for example in dataset
    ]


sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test")
sts_test_examples = dataset_to_input_examples(sts_test_dataset)

sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
    sts_test_examples, name="sts-test"
)

sts_test_evaluator(model, ".")

Evaluation Results

The following are the evaluation results of the model on the stsb_multi_mt dataset (French data, test split):

Model	Pearson Correlation	Parameters
`h4c5/sts-camembert-base`	0.837	110M
`Lajavaness/sentence-camembert-base`	0.835	110M
`inokufu/flaubert-base-uncased-xnli-sts`	0.828	137M
`h4c5/sts-distilcamembert-base`	0.817	68M
`sentence-transformers/distiluse-base-multilingual-cased-v2`	0.786	135M

🔧 Technical Details

Training Parameters

The model was trained with the following parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 180 with parameters:

{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss
Parameters of the fit() method:

{
    "epochs": 10,
    "evaluation_steps": 1000,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 500,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: CamembertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

📄 License

This project is licensed under the MIT license.

📚 Documentation

Citing

If you use this model, please cite the following papers:

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    journal={"https://arxiv.org/abs/1908.10084"},
}

@inproceedings{sanh2019distilbert,
    title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
    author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
    booktitle={NeurIPS EMC^2 Workshop},
    journal={https://arxiv.org/abs/1910.01108},
    year={2019}
}

@inproceedings{martin2020camembert,
    title={CamemBERT: a Tasty French Language Model},
    author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
    booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
    journal={https://arxiv.org/abs/1911.03894},
    year={2020}
}

@inproceedings{delestre:hal-03674695,
    TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
    AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
    URL = {https://hal.archives-ouvertes.fr/hal-03674695},
    BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
    ADDRESS = {Vannes, France},
    YEAR = {2022},
    MONTH = Jul,
    KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
    PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
    HAL_ID = {hal-03674695},
    HAL_VERSION = {v1},
    journal={https://arxiv.org/abs/2205.11111},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご