The open-source model msmarco-MiniLM-L12-en-de-v1 - Empowering information retrieval passage re-ranking, free to use

Msmarco MiniLM L12 En De V1

Developed by cross-encoder

An English-German cross-lingual cross-encoder model trained on the MS Marco passage ranking task, suitable for passage re-ranking in information retrieval scenarios.

Text Embedding

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Cross-language retrieval #Passage re-ranking #English-German bilingual

Downloads 19.62k

Release Time : 3/2/2022

Model Overview

This model is an English-German cross-lingual cross-encoder for passage re-ranking, trained on the MS Marco passage ranking task, supporting information retrieval scenarios in English and German.

Model Features

Cross-language support

Supports bilingual information retrieval and passage re-ranking in English and German.

High-performance re-ranking

Performs excellently on benchmarks like TREC-DL19 and GermanDPR, significantly outperforming BM25 baselines.

Efficient inference

Can process 900 (query, document) pairs per second on a V100 GPU, suitable for large-scale retrieval scenarios.

Model Capabilities

Text ranking

Cross-language information retrieval

Passage re-ranking

Use Cases

Information retrieval

Search engine result re-ranking

Semantically re-rank results returned by traditional retrieval methods like BM25 to improve relevance.

Achieves NDCG@10 of 72.94 on TREC-DL19 English-English retrieval, significantly outperforming BM25's 45.46.

Cross-language retrieval

Supports ranking of English documents for German queries, or vice versa.

Achieves NDCG@10 of 66.07 on TREC-DL19 German-English retrieval.

🚀 Cross-Encoder for MS MARCO - EN-DE

This is a cross-lingual Cross-Encoder model for EN-DE, specifically designed for passage re-ranking. It was trained on the MS Marco Passage Ranking task.

The model serves well in Information Retrieval. You can refer to SBERT.net Retrieve & Re-rank for more details.

The training code is available in this repository, check train_script.py.

🚀 Quick Start

✨ Features

Cross-lingual support for EN-DE.
Applicable for passage re-ranking in Information Retrieval.

📦 Installation

The model can be used with SentenceTransformers or the transformers library. Make sure you have them installed.

💻 Usage Examples

Basic Usage with SentenceTransformers

When you have SentenceTransformers installed, you can use the model like this:

from sentence_transformers import CrossEncoder
model = CrossEncoder('model_name', max_length=512)
query = 'How many people live in Berlin?'
docs = ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.']
pairs = [(query, doc) for doc in docs]
scores = model.predict(pairs)

Advanced Usage with Transformers

With the transformers library, you can use the model like this:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained('model_name')
tokenizer = AutoTokenizer.from_pretrained('model_name')

features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],  padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    scores = model(**features).logits
    print(scores)

🔧 Technical Details

The performance was evaluated on three datasets:

TREC-DL19 EN-EN: The original TREC 2019 Deep Learning Track. Given an English query and 1000 documents (retrieved by BM25 lexical search), rank documents according to their relevance. We compute NDCG@10. BM25 achieves a score of 45.46, and a perfect re-ranker can achieve a score of 95.47.
TREC-DL19 DE-EN: The English queries of TREC-DL19 have been translated by a German native speaker to German. We rank the German queries versus the English passages from the original TREC-DL19 setup. We compute NDCG@10.
GermanDPR DE-DE: The GermanDPR dataset provides German queries and German passages from Wikipedia. We indexed the 2.8 Million paragraphs from German Wikipedia and retrieved for each query the top 100 most relevant passages using BM25 lexical search with Elasticsearch. We compute MRR@10. BM25 achieves a score of 35.85, and a perfect re-ranker can achieve a score of 76.27.

We also check the performance of bi-encoders using the same evaluation: The retrieved documents from BM25 lexical search are re-ranked using query & passage embeddings with cosine-similarity. Bi-Encoders can also be used for end-to-end semantic search.

Model-Name	TREC-DL19 EN-EN	TREC-DL19 DE-EN	GermanDPR DE-DE	Docs / Sec
BM25	45.46	-	35.85	-
Cross-Encoder Re-Rankers
cross-encoder/msmarco-MiniLM-L6-en-de-v1	72.43	65.53	46.77	1600
cross-encoder/msmarco-MiniLM-L12-en-de-v1	72.94	66.07	49.91	900
svalabs/cross-electra-ms-marco-german-uncased (DE only)	-	-	53.67	260
deepset/gbert-base-germandpr-reranking (DE only)	-	-	53.59	260
Bi-Encoders (re-ranking)
sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned	63.38	58.28	37.88	940
sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch	65.51	58.69	38.32	940
svalabs/bi-electra-ms-marco-german-uncased (DE only)	-	-	34.31	450
deepset/gbert-base-germandpr-question_encoder (DE only)	-	-	42.55	450

Note: Docs / Sec gives the number of (query, document) pairs we can re-rank within a second on a V100 GPU.

📄 License

This project is licensed under the Apache-2.0 license.

📚 Documentation

Property	Details
Model Type	Cross-Encoder
Training Data	MS Marco Passage Ranking, GermanDPR
Base Model	microsoft/Multilingual-MiniLM-L12-H384
Pipeline Tag	text-ranking
Library Name	sentence-transformers
Tags	transformers
Language	en, de

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご