Biencoder - Electra - Base - MmarcoFR Open - Source Model: Empowering French Semantic Search with Precise Information Matching

Biencoder Electra Base Mmarcofr

Developed by antoinelouis

This is a dense single-vector dual-encoder model for French, designed for semantic search. The model maps queries and passages to 768-dimensional dense vectors and calculates relevance through cosine similarity.

Text Embedding

Safetensors

FrenchOpen Source License:MIT #French semantic search #Dense passage retrieval #High recall rate

Downloads 31

Release Time : 5/22/2023

Model Overview

This model is a French sentence similarity model based on the ELECTRA architecture, specifically designed for passage retrieval tasks, capable of efficiently computing semantic relevance between queries and passages.

Model Features

French optimization

Specifically optimized for French text, trained on French ELECTRA models and the mMARCO dataset

Efficient retrieval

Utilizes a single-vector dual-encoder architecture for efficient semantic search and passage retrieval

Hard negative training

Trained with hard negatives mined from multiple dense retrievers to enhance model discrimination capability

Model Capabilities

French sentence embedding

Semantic similarity calculation

Passage retrieval

Information retrieval

Use Cases

Information retrieval

Document retrieval system

Build a French document retrieval system that returns the most relevant document passages based on user queries

Achieves Recall@500 of 81.6% on the mMARCO-fr validation set

Question answering system

Serve as the retrieval component of a question-answering system to quickly find candidate answer passages related to questions

🚀 biencoder-electra-base-french-mmarcoFR

This is a dense single - vector bi - encoder model for French, designed for semantic search. It maps queries and passages to 768 - dimensional dense vectors, and computes relevance through cosine similarity.

🚀 Quick Start

Here are some examples for using the model with Sentence - Transformers, FlagEmbedding, or Huggingface Transformers.

✨ Features

Semantic Search: Ideal for semantic search in French, mapping queries and passages to 768 - dimensional dense vectors.
Multiple Libraries Support: Can be used with Sentence - Transformers, FlagEmbedding, and Huggingface Transformers.

📦 Installation

To use this model, you need to install the corresponding libraries:

Sentence - Transformers: pip install -U sentence-transformers
FlagEmbedding: pip install -U FlagEmbedding
Huggingface Transformers: pip install -U transformers

💻 Usage Examples

Basic Usage

Using Sentence - Transformers

from sentence_transformers import SentenceTransformer

queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]

model = SentenceTransformer('antoinelouis/biencoder-electra-base-french-mmarcoFR')
q_embeddings = model.encode(queries, normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)

similarity = q_embeddings @ p_embeddings.T
print(similarity)

Using FlagEmbedding

from FlagEmbedding import FlagModel

queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]

model = FlagModel('antoinelouis/biencoder-electra-base-french-mmarcoFR')
q_embeddings = model.encode(queries, normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)

similarity = q_embeddings @ p_embeddings.T
print(similarity)

Using Transformers

from transformers import AutoTokenizer, AutoModel
from torch.nn.functional import normalize

def mean_pooling(model_output, attention_mask):
    """ Perform mean pooling on-top of the contextualized word embeddings, while ignoring mask tokens in the mean computation."""
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]

tokenizer = AutoTokenizer.from_pretrained('antoinelouis/biencoder-electra-base-french-mmarcoFR')
model = AutoModel.from_pretrained('antoinelouis/biencoder-electra-base-french-mmarcoFR')

q_input = tokenizer(queries, padding=True, truncation=True, return_tensors='pt')
p_input = tokenizer(passages, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    q_output = model(**encoded_queries)
    p_output = model(**encoded_passages)
q_embeddings = mean_pooling(q_output, q_input['attention_mask'])
q_embedddings = normalize(q_embeddings, p=2, dim=1)
p_embeddings = mean_pooling(p_output, p_input['attention_mask'])
p_embedddings = normalize(p_embeddings, p=2, dim=1)

similarity = q_embeddings @ p_embeddings.T
print(similarity)

📚 Documentation

Evaluation

The model is evaluated on the smaller development set of [mMARCO - fr](https://ir - datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for a corpus of 8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut - offs (R@k). To see how it compares to other neural retrievers in French, check out the DécouvrIR leaderboard.

Training

Data

We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp - dl/mmarco) dataset, a multilingual machine - translated version of MS MARCO that contains 8.8M passages and 539K training queries. We do not employ the BM25 netaives provided by the official dataset but instead sample harder negatives mined from 12 distinct dense retrievers, using the [msmarco - hard - negatives](https://huggingface.co/datasets/sentence - transformers/msmarco - hard - negatives) distillation dataset.

Implementation

The model is initialized from the [dbmdz/electra - base - french - europeana - cased - discriminator](https://huggingface.co/dbmdz/electra - base - french - europeana - cased - discriminator) checkpoint and optimized via the cross - entropy loss (as in DPR) with a temperature of 0.05. It is fine - tuned on one 32GB NVIDIA V100 GPU for 20 epochs (i.e., 62.4k steps) using the AdamW optimizer with a batch size of 160, a peak learning rate of 2e - 5 with warm up along the first 500 steps and linear scheduling. We set the maximum sequence lengths for both the questions and passages to 128 tokens. We use the cosine similarity to compute relevance scores.

🔧 Technical Details

Property	Details
Model Type	Dense single - vector bi - encoder model for French
Training Data	French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp - dl/mmarco) dataset, with harder negatives from [msmarco - hard - negatives](https://huggingface.co/datasets/sentence - transformers/msmarco - hard - negatives)
Initial Checkpoint	[dbmdz/electra - base - french - europeana - cased - discriminator](https://huggingface.co/dbmdz/electra - base - french - europeana - cased - discriminator)
Loss Function	Cross - entropy loss with a temperature of 0.05
Optimizer	AdamW
Batch Size	160
Learning Rate	Peak learning rate of 2e - 5 with warm up along the first 500 steps and linear scheduling
Epochs	20 epochs (62.4k steps)
GPU	One 32GB NVIDIA V100 GPU
Maximum Sequence Length	128 tokens for both questions and passages
Relevance Computation	Cosine similarity

📄 License

This project is licensed under the MIT license.

📖 Citation

@online{louis2024decouvrir,
	author    = 'Antoine Louis',
	title     = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
	publisher = 'Hugging Face',
	month     = 'mar',
	year      = '2024',
	url       = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご