Open-source model crossencoder-xlm-roberta-base-mmarcoFR - Free deployment to boost French semantic search passage re-ranking

Crossencoder Xlm Roberta Base Mmarcofr

Developed by antoinelouis

This is a French cross-encoder model based on XLM-RoBERTa, specifically designed for passage re-ranking tasks in semantic search.

Text Embedding

Safetensors

FrenchOpen Source License:MIT #French Semantic Re-ranking #Cross-encoder Architecture #Information Retrieval Optimization

Downloads 51

Release Time : 5/3/2024

Model Overview

The model performs cross-attention calculations on question-passage pairs and outputs relevance scores, primarily used for re-ranking results returned by primary retrieval systems in semantic search.

Model Features

Efficient Re-ranking

Capable of efficiently re-ranking results returned by primary retrieval systems to improve search result quality.

Multilingual Support

Based on XLM-RoBERTa architecture, it has excellent multilingual processing capabilities.

High Precision

Performs exceptionally well on the mMARCO-fr dataset, achieving a Recall@500 of 96.03%.

Model Capabilities

Text Relevance Scoring

Semantic Search Optimization

Passage Re-ranking

Use Cases

Information Retrieval

Search Engine Result Optimization

Re-rank search engine results to improve the ranking of relevant results

Achieves a recall rate of 96.03% in the top 500 results

Question Answering Systems

Rank candidate answers by relevance in question-answering systems

Achieves an average reciprocal rank (MRR) of 34.19 in the top 10 results

🚀 crossencoder-xlm-roberta-base-mmarcoFR

A cross-encoder model for French that performs cross-attention between a question-passage pair and outputs a relevance score, useful as a reranker for semantic search.

🚀 Quick Start

This cross-encoder model for French performs cross-attention between a question-passage pair and outputs a relevance score. It should be used as a reranker for semantic search. Given a query and a set of potentially relevant passages retrieved by an efficient first - stage retrieval system (e.g., BM25 or a fine - tuned dense single - vector bi - encoder), you can encode each query - passage pair and sort the passages in a decreasing order of relevance according to the model's predicted scores.

💻 Usage Examples

Basic Usage

Here are some examples for using the model with Sentence - Transformers, FlagEmbedding, or Huggingface Transformers.

Using Sentence - Transformers

Start by installing the library: pip install -U sentence-transformers. Then, you can use the model like this:

from sentence_transformers import CrossEncoder

pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]

model = CrossEncoder('antoinelouis/crossencoder-xlm-roberta-base-mmarcoFR')
scores = model.predict(pairs)
print(scores)

Using FlagEmbedding

Start by installing the library: pip install -U FlagEmbedding. Then, you can use the model like this:

from FlagEmbedding import FlagReranker

pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]

reranker = FlagReranker('antoinelouis/crossencoder-xlm-roberta-base-mmarcoFR')
scores = reranker.compute_score(pairs)
print(scores)

Using HuggingFace Transformers

Start by installing the library: pip install -U transformers. Then, you can use the model like this:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]

tokenizer = AutoTokenizer.from_pretrained('antoinelouis/crossencoder-xlm-roberta-base-mmarcoFR')
model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/crossencoder-xlm-roberta-base-mmarcoFR')
model.eval()

with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
print(scores)

📚 Documentation

Evaluation

The model is evaluated on the smaller development set of mMARCO - fr, which consists of 6,980 queries for which an ensemble of 1000 passages containing the positive(s) and ColBERTv2 hard negatives need to be reranked. We report the mean reciprocal rank (MRR) and recall at various cut - offs (R@k). To see how it compares to other neural retrievers in French, check out the DécouvrIR leaderboard.

Training

Data

We use the French training samples from the mMARCO dataset, a multilingual machine - translated version of MS MARCO that contains 8.8M passages and 539K training queries. We do not use the BM25 negatives provided by the official dataset but instead sample harder negatives mined from 12 distinct dense retrievers, using the msmarco - hard - negatives distillation dataset. Eventually, we sample 2.6M training triplets of the form (query, passage, relevance) with a positive - to - negative ratio of 1 (i.e., 50% of the pairs are relevant and 50% are irrelevant).

Implementation

The model is initialized from the [FacebookAI/xlm - roberta - base](https://huggingface.co/FacebookAI/xlm - roberta - base) checkpoint and optimized via the binary cross - entropy loss (as in monoBERT). It is fine - tuned on one 80GB NVIDIA H100 GPU for 20k steps using the AdamW optimizer with a batch size of 128 and a constant learning rate of 2e - 5. We set the maximum sequence length of the concatenated question - passage pairs to 256 tokens. We use the sigmoid function to get scores between 0 and 1.

Model Information

Property	Details
Pipeline Tag	text - ranking
Language	French
License	MIT
Datasets	unicamp - dl/mmarco
Metrics	recall
Tags	passage - reranking
Library Name	sentence - transformers
Base Model	FacebookAI/xlm - roberta - base

Results

Model Name	Task Type	Task Name	Dataset Name	Dataset Type	Dataset Config	Dataset Split	Metric Type	Metric Value	Metric Name
crossencoder - xlm - roberta - base - mmarcoFR	text - classification	Passage Reranking	mMARCO - fr	unicamp - dl/mmarco	french	validation	recall_at_500	96.03	Recall@500
crossencoder - xlm - roberta - base - mmarcoFR	text - classification	Passage Reranking	mMARCO - fr	unicamp - dl/mmarco	french	validation	recall_at_100	85.03	Recall@100
crossencoder - xlm - roberta - base - mmarcoFR	text - classification	Passage Reranking	mMARCO - fr	unicamp - dl/mmarco	french	validation	recall_at_10	59.57	Recall@10
crossencoder - xlm - roberta - base - mmarcoFR	text - classification	Passage Reranking	mMARCO - fr	unicamp - dl/mmarco	french	validation	mrr_at_10	34.19	MRR@10

📄 License

This model is released under the MIT license.

📚 Citation

@online{louis2024decouvrir,
	author    = 'Antoine Louis',
	title     = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
	publisher = 'Hugging Face',
	month     = 'mar',
	year      = '2024',
	url       = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご