đ Cross-Encoder for MS MARCO - EN-DE
A cross-lingual Cross-Encoder model for EN-DE, designed for passage re-ranking and trained on the MS Marco Passage Ranking task.
This cross-lingual Cross-Encoder model for EN-DE can be used for passage re-ranking. It was trained on the MS Marco Passage Ranking task. The model can be applied in Information Retrieval, as detailed in SBERT.net Retrieve & Re-rank. The training code is available in this repository, see train_script.py
.
đ Quick Start
⨠Features
- Cross-lingual support for EN-DE languages.
- Suitable for passage re-ranking in Information Retrieval tasks.
- Training code is provided in the repository.
đĻ Installation
This section assumes you have the necessary dependencies installed. For SentenceTransformers
and transformers
libraries, you can install them using pip
:
pip install sentence-transformers transformers
đģ Usage Examples
Basic Usage with SentenceTransformers
When you have SentenceTransformers installed, you can use the model like this:
from sentence_transformers import CrossEncoder
model = CrossEncoder('model_name', max_length=512)
query = 'How many people live in Berlin?'
docs = ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.']
pairs = [(query, doc) for doc in docs]
scores = model.predict(pairs)
Basic Usage with Transformers
With the transformers
library, you can use the model like this:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model = AutoModelForSequenceClassification.from_pretrained('model_name')
tokenizer = AutoTokenizer.from_pretrained('model_name')
features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'], padding=True, truncation=True, return_tensors="pt")
model.eval()
with torch.no_grad():
scores = model(**features).logits
print(scores)
đ Documentation
The model's performance was evaluated on three datasets:
- TREC-DL19 EN-EN: The original TREC 2019 Deep Learning Track: Given an English query and 1000 documents (retrieved by BM25 lexical search), rank documents according to their relevance. We compute NDCG@10. BM25 achieves a score of 45.46, and a perfect re-ranker can achieve a score of 95.47.
- TREC-DL19 DE-EN: The English queries of TREC-DL19 have been translated by a German native speaker to German. We rank the German queries versus the English passages from the original TREC-DL19 setup. We compute NDCG@10.
- GermanDPR DE-DE: The GermanDPR dataset provides German queries and German passages from Wikipedia. We indexed the 2.8 Million paragraphs from German Wikipedia and retrieved for each query the top 100 most relevant passages using BM25 lexical search with Elasticsearch. We compute MRR@10. BM25 achieves a score of 35.85, and a perfect re-ranker can achieve a score of 76.27.
We also check the performance of bi-encoders using the same evaluation: The retrieved documents from BM25 lexical search are re-ranked using query & passage embeddings with cosine-similarity. Bi-Encoders can also be used for end-to-end semantic search.
Note: Docs / Sec gives the number of (query, document) pairs we can re-rank within a second on a V100 GPU.
đ License
This project is licensed under the Apache-2.0 license.