Model Overview
Model Features
Model Capabilities
Use Cases
๐ PM-AI/bi-encoder_msmarco_bert-base_german
This model is designed for semantic search and documents retrieval, enabling users to find relevant passages based on a given query. It was trained on a machine - translated MSMARCO dataset for German, leveraging hard negatives and Margin MSE loss to achieve state - of - the - art performance in asymmetric search.
๐ Quick Start
The model can be easily used with the Sentence Transformer library.
โจ Features
- Semantic Search: Capable of performing semantic search to find relevant passages according to a query.
- Documents Retrieval: Efficiently retrieves documents based on the input query.
- SOTA Performance: Achieves state - of - the - art results in asymmetric search through a combination of hard negatives and Margin MSE loss.
๐ฆ Installation
No specific installation steps are provided in the original document.
๐ป Usage Examples
Basic Usage
The model can be used in conjunction with the Sentence Transformer library. Although no basic usage code is provided in the original document, here is a general example of using Sentence Transformer:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('PM - AI/bi - encoder_msmarco_bert - base_german')
query = "Your query here"
passages = ["Passage 1", "Passage 2"]
query_embedding = model.encode(query)
passage_embeddings = model.encode(passages)
Advanced Usage
The custom - made script [mmarco_beir.py](https://huggingface.co/PM - AI/bi - encoder_msmarco_bert - base_german/blob/main/mmarco_beir.py) contains all necessary adaptations for BEIR compatibility. Here is the code:
# mmarco_beir.py
import json
import os
import urllib.request
import datasets
# see https://huggingface.co/datasets/unicamp-dl/mmarco for supported languages
LANGUAGE = "german"
# target directory containin BEIR (https://github.com/beir-cellar/beir) compatible files
OUT_DIR = f"mmarco-google/{LANGUAGE}/"
os.makedirs(OUT_DIR, exist_ok=True)
# download google based collection/corpus translation of msmarco and write corpus.jsonl for BEIR compatibility
mmarco_ds = datasets.load_dataset("unicamp-dl/mmarco", f"collection-{LANGUAGE}")
with open(os.path.join(OUT_DIR, "corpus.jsonl"), "w", encoding="utf-8") as out_file:
for entry in mmarco_ds["collection"]:
entry = {"_id": str(entry["id"]), "title": "", "text": entry["text"]}
out_file.write(f'{json.dumps(entry, ensure_ascii=False)}\n')
# # download google based queries translation of msmarco and write queries.jsonl for BEIR compatibility
mmarco_ds = datasets.load_dataset("unicamp-dl/mmarco", f"queries-{LANGUAGE}")
mmarco_ds = datasets.concatenate_datasets([mmarco_ds["train"], mmarco_ds["dev.full"]])
with open(os.path.join(OUT_DIR, "queries.jsonl"), "w", encoding="utf-8") as out_file:
for entry in mmarco_ds:
entry = {"_id": str(entry["id"]), "text": entry["text"]}
out_file.write(f'{json.dumps(entry, ensure_ascii=False)}\n')
QRELS_DIR = os.path.abspath(os.path.join(OUT_DIR, "../qrels/"))
os.makedirs(QRELS_DIR, exist_ok=True)
# download qrels from URL instead of HF dataset
# note: qrels are language independent
for link in ["https://huggingface.co/datasets/BeIR/msmarco-qrels/resolve/main/dev.tsv",
"https://huggingface.co/datasets/BeIR/msmarco-qrels/resolve/main/train.tsv"]:
urllib.request.urlretrieve(link, os.path.join(QRELS_DIR, os.path.basename(link)))
๐ Documentation
Model summary
This model can be used for semantic search and documents retrieval to find relevant passages based on a query. It was trained on a machine translated MSMARCO dataset for German with hard negatives and Margin MSE loss, resulting in a SOTA transformer for asymmetric search.
Training Data
The model is trained on samples from the MSMARCO Passage Ranking dataset, which contains about 500,000 questions and 8.8 million passages. The original dataset was in English and has been machine - translated into other languages. The "mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset" is used, which includes a German translation.
BEIR requires a specific structure for training data:
Property | Details |
---|---|
corpus.jsonl |
Contains one JSON string per line with _id , title and text . Example: {"_id": "1234", "title": "", "text": "some text"} |
queries.jsonl |
Each JSON string per line requires an _id and a text . Example: {"_id": "5678", "text": "a question?"} |
qrels/dev.tsv |
Represents the relation between question (query - id ) and correct answer (corpus - id ). The score column is mandatory and always 1. Example: 1234 5678 1 |
qrels/train.tsv |
Has the same structure as dev.tsv |
Training
The training is run using the [BEIR Benchmark Framework](https://github.com/beir - cellar/beir). The model is trained on the MSMARCO dataset with the Margin MSE loss method, using "hard negatives".
Parameterization of training
Property | Details |
---|---|
Script | [train_msmarco_v3_margin_MSE.py](https://github.com/beir - cellar/beir/blob/main/examples/retrieval/training/train_msmarco_v3_margin_MSE.py) |
Dataset | mmarco (compatibility established using [mmarco_beir.py](https://huggingface.co/PM - AI/bi - encoder_msmarco_bert - base_german/blob/main/mmarco_beir.py)), train split |
GPU | NVIDIA A40 (Driver Version: 515.48.07; CUDA Version: 11.7) |
Batch Size | 75 |
Max. Sequence Length | 350 |
Base Model | [deepset/gbert - base](https://huggingface.co/deepset/gbert - base) |
Loss function | Margin MSE |
Epochs | 10 |
Evaluation Steps | 10000 |
Warmup Steps | 1000 |
Evaluation
The evaluation is based on germanDPR. The BEIR - compatible germanDPR dataset consists of 9275 questions with 23993 text passages for the train split. The following table shows the evaluation results:
Model | NDCG@1 | NDCG@10 | NDCG@100 | Comment |
---|---|---|---|---|
bi - encoder_msmarco_bert - base_german (new) | 0.5300 ๐ |
0.7196 ๐ |
0.7360 ๐ |
"OUR model" |
[deepset/gbert - base - germandpr - X_encoder](https://huggingface.co/deepset/gbert - base - germandpr - ctx_encoder) | 0.4828 | 0.6970 | 0.7147 | "has two encoder models (one for queries and one for corpus), is SOTA approach" |
[distiluse - base - multilingual - cased - v1](https://huggingface.co/sentence - transformers/distiluse - base - multilingual - cased - v1) | 0.4561 | 0.6347 | 0.6613 | "trained on 15 languages" |
[paraphrase - multilingual - mpnet - base - v2](https://huggingface.co/sentence - transformers/paraphrase - multilingual - mpnet - base - v2) | 0.4511 | 0.6328 | 0.6592 | "trained on huge corpus, support for 50+ languages" |
[distiluse - base - multilingual - cased - v2](https://huggingface.co/sentence - transformers/distiluse - base - multilingual - cased - v2) | 0.4350 | 0.6103 | 0.6411 | "trained on 50+ languages" |
๐ง Technical Details
Hard Negatives
We use the MSMARCO Hard Negatives File (Provided by Nils Reimers): https://sbert.net/datasets/msmarco - hard - negatives.jsonl.gz Negative passage are hard negative examples, that were mined using different dense embedding, cross - encoder methods and lexical search methods. Contains upto 50 negatives for each of the four retrieval systems: [bm25, msmarco - distilbert - base - tas - b, msmarco - MiniLM - L - 6 - v3, msmarco - distilbert - base - v3] Each positive and negative passage comes with a score from a Cross - Encoder (msmarco - MiniLM - L - 6 - v3). This allows denoising, i.e., removing false negative passages that are actually relevant for the query.
[Source](https://github.com/beir - cellar/beir/blob/main/examples/retrieval/training/train_msmarco_v3_margin_MSE.py)
MarginMSELoss
MarginMSELoss is based on the paper of Hofstรคtter et al. As for MultipleNegativesRankingLoss, we have triplets: (query, passage1, passage2). In contrast to MultipleNegativesRankingLoss, passage1 and passage2 do not have to be strictly positive/negative, both can be relevant or not relevant for a given query. We then compute the Cross - Encoder score for (query, passage1) and (query, passage2). We provide scores for 160 million such pairs in our msmarco - hard - negatives dataset. We then compute the distance: CE_distance = CEScore(query, passage1) - CEScore(query, passage2) For our bi - encoder training, we encode query, passage1, and passage2 into vector spaces and then measure the dot - product between (query, passage1) and (query, passage2). Again, we measure the distance: BE_distance = DotScore(query, passage1) - DotScore(query, passage2) We then want to ensure that the distance predicted by the bi - encoder is close to the distance predicted by the cross - encoder, i.e., we optimize the mean - squared error (MSE) between CE_distance and BE_distance. An advantage of MarginMSELoss compared to MultipleNegativesRankingLoss is that we donโt require a positive and negative passage. As mentioned before, MS MARCO is redundant, and many passages contain the same or similar content. With MarginMSELoss, we can train on two relevant passages without issues: In that case, the CE_distance will be smaller and we expect that our bi - encoder also puts both passages closer in the vector space. And disadvantage of MarginMSELoss is the slower training time: We need way more epochs to get good results. In MultipleNegativesRankingLoss, with a batch size of 64, we compare one query against 128 passages. With MarginMSELoss, we compare a query only against two passages.
[Source](https://github.com/UKPLab/sentence - transformers/blob/master/examples/training/ms_marco/README.md)
๐ License
The model is released under the MIT license.





