msmarco - MiniLM - L6 - en - de - v1 Open - Source Model - Free Support for English

Msmarco MiniLM L6 En De V1

Developed by cross-encoder

This is a cross-lingual cross-encoder model suitable for English-German bilingual paragraph re-ranking tasks, trained based on the MS Marco passage ranking task.

Text Embedding

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #English-German Cross-Lingual #Paragraph Re-ranking #Information Retrieval

Downloads 2,784

Release Time : 3/2/2022

Model Overview

This model is used for paragraph re-ranking tasks in information retrieval scenarios, supporting bilingual query and document matching in English and German.

Model Features

Cross-Lingual Support

Supports bilingual query and document matching in English and German, enabling cross-lingual information retrieval.

Efficient Re-ranking

Optimizes the results of traditional retrieval methods like BM25, significantly improving retrieval quality.

High Performance

Performs excellently in benchmark tests such as TREC-DL19 and GermanDPR, surpassing baseline models.

Model Capabilities

English-German bilingual text matching

Retrieval result re-ranking

Cross-lingual information retrieval

Use Cases

Information Retrieval

Search Engine Result Optimization

Semantically re-ranks results returned by traditional search engines.

Achieved NDCG@10 of 72.94 in TREC-DL19 tests.

Cross-Lingual Document Retrieval

Retrieves English documents using German queries.

Achieved NDCG@10 of 66.07 in TREC-DL19 German-English tests.

🚀 Cross-Encoder for MS MARCO - EN-DE

A cross-lingual Cross-Encoder model for EN-DE, designed for passage re-ranking and trained on the MS Marco Passage Ranking task.

This cross-lingual Cross-Encoder model for EN-DE can be used for passage re-ranking. It was trained on the MS Marco Passage Ranking task. The model can be applied in Information Retrieval, as detailed in SBERT.net Retrieve & Re-rank. The training code is available in this repository, see train_script.py.

🚀 Quick Start

✨ Features

Cross-lingual support for EN-DE languages.
Suitable for passage re-ranking in Information Retrieval tasks.
Training code is provided in the repository.

📦 Installation

This section assumes you have the necessary dependencies installed. For SentenceTransformers and transformers libraries, you can install them using pip:

pip install sentence-transformers transformers

💻 Usage Examples

Basic Usage with SentenceTransformers

When you have SentenceTransformers installed, you can use the model like this:

from sentence_transformers import CrossEncoder

model = CrossEncoder('model_name', max_length=512)

query = 'How many people live in Berlin?'
docs = ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.']
pairs = [(query, doc) for doc in docs]
scores = model.predict(pairs)

Basic Usage with Transformers

With the transformers library, you can use the model like this:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained('model_name')
tokenizer = AutoTokenizer.from_pretrained('model_name')

features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],  padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    scores = model(**features).logits
    print(scores)

📚 Documentation

The model's performance was evaluated on three datasets:

TREC-DL19 EN-EN: The original TREC 2019 Deep Learning Track: Given an English query and 1000 documents (retrieved by BM25 lexical search), rank documents according to their relevance. We compute NDCG@10. BM25 achieves a score of 45.46, and a perfect re-ranker can achieve a score of 95.47.
TREC-DL19 DE-EN: The English queries of TREC-DL19 have been translated by a German native speaker to German. We rank the German queries versus the English passages from the original TREC-DL19 setup. We compute NDCG@10.
GermanDPR DE-DE: The GermanDPR dataset provides German queries and German passages from Wikipedia. We indexed the 2.8 Million paragraphs from German Wikipedia and retrieved for each query the top 100 most relevant passages using BM25 lexical search with Elasticsearch. We compute MRR@10. BM25 achieves a score of 35.85, and a perfect re-ranker can achieve a score of 76.27.

We also check the performance of bi-encoders using the same evaluation: The retrieved documents from BM25 lexical search are re-ranked using query & passage embeddings with cosine-similarity. Bi-Encoders can also be used for end-to-end semantic search.

Model-Name	TREC-DL19 EN-EN	TREC-DL19 DE-EN	GermanDPR DE-DE	Docs / Sec
BM25	45.46	-	35.85	-
Cross-Encoder Re-Rankers
cross-encoder/msmarco-MiniLM-L6-en-de-v1	72.43	65.53	46.77	1600
cross-encoder/msmarco-MiniLM-L12-en-de-v1	72.94	66.07	49.91	900
svalabs/cross-electra-ms-marco-german-uncased (DE only)	-	-	53.67	260
deepset/gbert-base-germandpr-reranking (DE only)	-	-	53.59	260
Bi-Encoders (re-ranking)
sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned	63.38	58.28	37.88	940
sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch	65.51	58.69	38.32	940
svalabs/bi-electra-ms-marco-german-uncased (DE only)	-	-	34.31	450
deepset/gbert-base-germandpr-question_encoder (DE only)	-	-	42.55	450

Note: Docs / Sec gives the number of (query, document) pairs we can re-rank within a second on a V100 GPU.

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご