🚀 biencoder-electra-base-french-mmarcoFR
This is a dense single - vector bi - encoder model for French, designed for semantic search. It maps queries and passages to 768 - dimensional dense vectors, and computes relevance through cosine similarity.
🚀 Quick Start
Here are some examples for using the model with Sentence - Transformers, FlagEmbedding, or Huggingface Transformers.
✨ Features
- Semantic Search: Ideal for semantic search in French, mapping queries and passages to 768 - dimensional dense vectors.
- Multiple Libraries Support: Can be used with Sentence - Transformers, FlagEmbedding, and Huggingface Transformers.
📦 Installation
To use this model, you need to install the corresponding libraries:
- Sentence - Transformers:
pip install -U sentence-transformers
- FlagEmbedding:
pip install -U FlagEmbedding
- Huggingface Transformers:
pip install -U transformers
💻 Usage Examples
Basic Usage
Using Sentence - Transformers
from sentence_transformers import SentenceTransformer
queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]
model = SentenceTransformer('antoinelouis/biencoder-electra-base-french-mmarcoFR')
q_embeddings = model.encode(queries, normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)
similarity = q_embeddings @ p_embeddings.T
print(similarity)
Using FlagEmbedding
from FlagEmbedding import FlagModel
queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]
model = FlagModel('antoinelouis/biencoder-electra-base-french-mmarcoFR')
q_embeddings = model.encode(queries, normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)
similarity = q_embeddings @ p_embeddings.T
print(similarity)
Using Transformers
from transformers import AutoTokenizer, AutoModel
from torch.nn.functional import normalize
def mean_pooling(model_output, attention_mask):
""" Perform mean pooling on-top of the contextualized word embeddings, while ignoring mask tokens in the mean computation."""
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]
tokenizer = AutoTokenizer.from_pretrained('antoinelouis/biencoder-electra-base-french-mmarcoFR')
model = AutoModel.from_pretrained('antoinelouis/biencoder-electra-base-french-mmarcoFR')
q_input = tokenizer(queries, padding=True, truncation=True, return_tensors='pt')
p_input = tokenizer(passages, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
q_output = model(**encoded_queries)
p_output = model(**encoded_passages)
q_embeddings = mean_pooling(q_output, q_input['attention_mask'])
q_embedddings = normalize(q_embeddings, p=2, dim=1)
p_embeddings = mean_pooling(p_output, p_input['attention_mask'])
p_embedddings = normalize(p_embeddings, p=2, dim=1)
similarity = q_embeddings @ p_embeddings.T
print(similarity)
📚 Documentation
Evaluation
The model is evaluated on the smaller development set of [mMARCO - fr](https://ir - datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for a corpus of 8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut - offs (R@k).
To see how it compares to other neural retrievers in French, check out the DécouvrIR leaderboard.
Training
Data
We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp - dl/mmarco) dataset, a multilingual machine - translated version of MS MARCO that contains 8.8M passages and 539K training queries. We do not employ the BM25 netaives provided by the official dataset but instead sample harder negatives mined from 12 distinct dense retrievers, using the [msmarco - hard - negatives](https://huggingface.co/datasets/sentence - transformers/msmarco - hard - negatives) distillation dataset.
Implementation
The model is initialized from the [dbmdz/electra - base - french - europeana - cased - discriminator](https://huggingface.co/dbmdz/electra - base - french - europeana - cased - discriminator) checkpoint and optimized via the cross - entropy loss (as in DPR) with a temperature of 0.05. It is fine - tuned on one 32GB NVIDIA V100 GPU for 20 epochs (i.e., 62.4k steps) using the AdamW optimizer with a batch size of 160, a peak learning rate of 2e - 5 with warm up along the first 500 steps and linear scheduling. We set the maximum sequence lengths for both the questions and passages to 128 tokens. We use the cosine similarity to compute relevance scores.
🔧 Technical Details
Property |
Details |
Model Type |
Dense single - vector bi - encoder model for French |
Training Data |
French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp - dl/mmarco) dataset, with harder negatives from [msmarco - hard - negatives](https://huggingface.co/datasets/sentence - transformers/msmarco - hard - negatives) |
Initial Checkpoint |
[dbmdz/electra - base - french - europeana - cased - discriminator](https://huggingface.co/dbmdz/electra - base - french - europeana - cased - discriminator) |
Loss Function |
Cross - entropy loss with a temperature of 0.05 |
Optimizer |
AdamW |
Batch Size |
160 |
Learning Rate |
Peak learning rate of 2e - 5 with warm up along the first 500 steps and linear scheduling |
Epochs |
20 epochs (62.4k steps) |
GPU |
One 32GB NVIDIA V100 GPU |
Maximum Sequence Length |
128 tokens for both questions and passages |
Relevance Computation |
Cosine similarity |
📄 License
This project is licensed under the MIT license.
📖 Citation
@online{louis2024decouvrir,
author = 'Antoine Louis',
title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
publisher = 'Hugging Face',
month = 'mar',
year = '2024',
url = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
}