biencoder-distilcamembert-mmarcoFR Open Source Model - Empowering French Semantic Search, Precise and Efficient

Biencoder Distilcamembert Mmarcofr

Developed by antoinelouis

This is a dense single-vector dual encoder model for French, suitable for semantic search. The model maps queries and passages to 768-dimensional dense vectors and calculates relevance through cosine similarity.

Text Embedding

Safetensors

FrenchOpen Source License:MIT #French semantic search #Dense passage retrieval #High recall rate

Downloads 160

Release Time : 5/22/2023

Model Overview

This model is a dual encoder based on DistilCamemBERT, specifically optimized for French information retrieval tasks, capable of efficiently computing semantic similarity between queries and passages.

Model Features

French optimization

Semantic retrieval model specifically optimized for French text

Efficient retrieval

Uses 768-dimensional dense vector representation to support fast cosine similarity calculation

Hard negative mining

Utilized hard negative samples mined by 12 different retrievers during training

Model Capabilities

Semantic similarity calculation

Passage retrieval

Information retrieval

Use Cases

Information retrieval

Document retrieval system

Build a French document retrieval system that returns the most relevant documents based on user queries

Achieved Recall@500 of 87.9 on the mMARCO-fr validation set

Question answering system

Serves as the retrieval component in a QA system to find relevant passages from a knowledge base

🚀 biencoder-distilcamembert-mmarcoFR

This is a dense single-vector bi-encoder model for French, designed for semantic search. It maps queries and passages to 768-dimensional dense vectors, and computes relevance via cosine similarity.

🚀 Quick Start

Here are some examples for using the model with Sentence-Transformers, FlagEmbedding, or Huggingface Transformers.

💻 Usage Examples

Basic Usage

Using Sentence-Transformers

Start by installing the library: pip install -U sentence-transformers. Then, you can use the model like this:

from sentence_transformers import SentenceTransformer

queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]

model = SentenceTransformer('antoinelouis/biencoder-distilcamembert-mmarcoFR')
q_embeddings = model.encode(queries, normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)

similarity = q_embeddings @ p_embeddings.T
print(similarity)

Using FlagEmbedding

Start by installing the library: pip install -U FlagEmbedding. Then, you can use the model like this:

from FlagEmbedding import FlagModel

queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]

model = FlagModel('antoinelouis/biencoder-distilcamembert-mmarcoFR')
q_embeddings = model.encode(queries, normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)

similarity = q_embeddings @ p_embeddings.T
print(similarity)

Using Transformers

Start by installing the library: pip install -U transformers. Then, you can use the model like this:

from transformers import AutoTokenizer, AutoModel
from torch.nn.functional import normalize

def mean_pooling(model_output, attention_mask):
    """ Perform mean pooling on-top of the contextualized word embeddings, while ignoring mask tokens in the mean computation."""
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]

tokenizer = AutoTokenizer.from_pretrained('antoinelouis/biencoder-distilcamembert-mmarcoFR')
model = AutoModel.from_pretrained('antoinelouis/biencoder-distilcamembert-mmarcoFR')

q_input = tokenizer(queries, padding=True, truncation=True, return_tensors='pt')
p_input = tokenizer(passages, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    q_output = model(**encoded_queries)
    p_output = model(**encoded_passages)
q_embeddings = mean_pooling(q_output, q_input['attention_mask'])
q_embedddings = normalize(q_embeddings, p=2, dim=1)
p_embeddings = mean_pooling(p_output, p_input['attention_mask'])
p_embedddings = normalize(p_embeddings, p=2, dim=1)

similarity = q_embeddings @ p_embeddings.T
print(similarity)

📚 Documentation

Evaluation

The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k). To see how it compares to other neural retrievers in French, check out the DécouvrIR leaderboard.

Training

Data

We use the French training samples from the mMARCO dataset, a multilingual machine-translated version of MS MARCO that contains 8.8M passages and 539K training queries. We do not employ the BM25 netaives provided by the official dataset but instead sample harder negatives mined from 12 distinct dense retrievers, using the msmarco-hard-negatives distillation dataset.

Implementation

The model is initialized from the cmarkea/distilcamembert-base checkpoint and optimized via the cross-entropy loss (as in DPR) with a temperature of 0.05. It is fine-tuned on one 32GB NVIDIA V100 GPU for 20 epochs (i.e., 65.7k steps) using the AdamW optimizer with a batch size of 152, a peak learning rate of 2e-5 with warm up along the first 500 steps and linear scheduling. We set the maximum sequence lengths for both the questions and passages to 128 tokens. We use the cosine similarity to compute relevance scores.

📄 License

This project is licensed under the MIT license.

📚 Citation

@online{louis2024decouvrir,
	author    = 'Antoine Louis',
	title     = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
	publisher = 'Hugging Face',
	month     = 'mar',
	year      = '2024',
	url       = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
}

📋 Model Information

Property	Details
Pipeline Tag	Sentence Similarity
Language	French
License	MIT
Datasets	unicamp-dl/mmarco
Metrics	Recall
Tags	Passage Retrieval
Library Name	Sentence Transformers
Base Model	cmarkea/distilcamembert-base
Model Name	biencoder-distilcamembert-mmarcoFR
Results	Task: Sentence Similarity (Passage Retrieval) Dataset: mMARCO-fr (validation split) Metrics: Recall@500: 87.9 Recall@100: 76.4 Recall@10: 49.2 MAP@10: 26.2 nDCG@10: 31.9 MRR@10: 26.8

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご