Bi-encoder_msmarco_bert-base_german Open-Source Semantic Search Model - Supports Efficient and Accurate German Search

Bi Encoder Msmarco Bert Base German

Developed by PM-AI

Semantic search model trained on German version of MSMARCO dataset, optimized with hard negatives and Margin MSE loss function

Text Embedding

Transformers

Supports Multiple LanguagesOpen Source License:MIT #German Semantic Search #Hard Negative Training #Asymmetric Retrieval

Downloads 20.53k

Release Time : 11/23/2022

Model Overview

This model is specifically designed for German semantic search and document retrieval, capable of finding relevant passages based on queries. Trained on machine-translated German MSMARCO dataset with advanced training techniques for efficient retrieval.

Model Features

Hard Negative Training

Uses multi-system retrieval results as negative samples to enhance the model's ability to distinguish relevant passages

Margin MSE Loss Function

Guides bi-encoder training with cross-encoder to optimize similarity margin calculation

Asymmetric Search Optimization

Specifically optimized for query-passage asymmetric search scenarios

Cross-domain Applicability

Trained on multi-domain MSMARCO data to meet retrieval needs across different domains

Model Capabilities

Semantic search

Passage retrieval

Query-passage matching

Cross-domain information retrieval

Use Cases

Information Retrieval

Q&A Systems

Retrieve the most relevant answer passages based on user questions

Achieves NDCG@10 of 0.7196 on germandpr-beir test set

Document Search

Locate relevant content from large document collections

Outperforms traditional BM25 algorithm by approximately 34%

Enterprise Applications

Knowledge Base Retrieval

Quickly locate relevant information in enterprise knowledge bases

🚀 PM-AI/bi-encoder_msmarco_bert-base_german

This model is designed for semantic search and documents retrieval, enabling users to find relevant passages based on a given query. It was trained on a machine - translated MSMARCO dataset for German, leveraging hard negatives and Margin MSE loss to achieve state - of - the - art performance in asymmetric search.

🚀 Quick Start

The model can be easily used with the Sentence Transformer library.

✨ Features

Semantic Search: Capable of performing semantic search to find relevant passages according to a query.
Documents Retrieval: Efficiently retrieves documents based on the input query.
SOTA Performance: Achieves state - of - the - art results in asymmetric search through a combination of hard negatives and Margin MSE loss.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

The model can be used in conjunction with the Sentence Transformer library. Although no basic usage code is provided in the original document, here is a general example of using Sentence Transformer:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('PM - AI/bi - encoder_msmarco_bert - base_german')
query = "Your query here"
passages = ["Passage 1", "Passage 2"]
query_embedding = model.encode(query)
passage_embeddings = model.encode(passages)

Advanced Usage

The custom - made script [mmarco_beir.py](https://huggingface.co/PM - AI/bi - encoder_msmarco_bert - base_german/blob/main/mmarco_beir.py) contains all necessary adaptations for BEIR compatibility. Here is the code:

# mmarco_beir.py

import json
import os
import urllib.request

import datasets

# see https://huggingface.co/datasets/unicamp-dl/mmarco for supported languages
LANGUAGE = "german"
# target directory containin BEIR (https://github.com/beir-cellar/beir) compatible files
OUT_DIR = f"mmarco-google/{LANGUAGE}/"

os.makedirs(OUT_DIR, exist_ok=True)

# download google based collection/corpus translation of msmarco and write corpus.jsonl for BEIR compatibility
mmarco_ds = datasets.load_dataset("unicamp-dl/mmarco", f"collection-{LANGUAGE}")
with open(os.path.join(OUT_DIR, "corpus.jsonl"), "w", encoding="utf-8") as out_file:
    for entry in mmarco_ds["collection"]:
        entry = {"_id": str(entry["id"]), "title": "", "text": entry["text"]}
        out_file.write(f'{json.dumps(entry, ensure_ascii=False)}\n')

# # download google based queries translation of msmarco and write queries.jsonl for BEIR compatibility
mmarco_ds = datasets.load_dataset("unicamp-dl/mmarco", f"queries-{LANGUAGE}")
mmarco_ds = datasets.concatenate_datasets([mmarco_ds["train"], mmarco_ds["dev.full"]])
with open(os.path.join(OUT_DIR, "queries.jsonl"), "w", encoding="utf-8") as out_file:
    for entry in mmarco_ds:
        entry = {"_id": str(entry["id"]), "text": entry["text"]}
        out_file.write(f'{json.dumps(entry, ensure_ascii=False)}\n')

QRELS_DIR = os.path.abspath(os.path.join(OUT_DIR, "../qrels/"))
os.makedirs(QRELS_DIR, exist_ok=True)

# download qrels from URL instead of HF dataset
# note: qrels are language independent
for link in ["https://huggingface.co/datasets/BeIR/msmarco-qrels/resolve/main/dev.tsv",
             "https://huggingface.co/datasets/BeIR/msmarco-qrels/resolve/main/train.tsv"]:
    urllib.request.urlretrieve(link, os.path.join(QRELS_DIR, os.path.basename(link)))

📚 Documentation

Model summary

This model can be used for semantic search and documents retrieval to find relevant passages based on a query. It was trained on a machine translated MSMARCO dataset for German with hard negatives and Margin MSE loss, resulting in a SOTA transformer for asymmetric search.

Training Data

The model is trained on samples from the MSMARCO Passage Ranking dataset, which contains about 500,000 questions and 8.8 million passages. The original dataset was in English and has been machine - translated into other languages. The "mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset" is used, which includes a German translation.

BEIR requires a specific structure for training data:

Property	Details
`corpus.jsonl`	Contains one JSON string per line with `_id`, `title` and `text`. Example: `{"_id": "1234", "title": "", "text": "some text"}`
`queries.jsonl`	Each JSON string per line requires an `_id` and a `text`. Example: `{"_id": "5678", "text": "a question?"}`
`qrels/dev.tsv`	Represents the relation between question (`query - id`) and correct answer (`corpus - id`). The `score` column is mandatory and always 1. Example: `1234 5678 1`
`qrels/train.tsv`	Has the same structure as `dev.tsv`

Training

The training is run using the [BEIR Benchmark Framework](https://github.com/beir - cellar/beir). The model is trained on the MSMARCO dataset with the Margin MSE loss method, using "hard negatives".

Parameterization of training

Property	Details
Script	[train_msmarco_v3_margin_MSE.py](https://github.com/beir - cellar/beir/blob/main/examples/retrieval/training/train_msmarco_v3_margin_MSE.py)
Dataset	mmarco (compatibility established using [mmarco_beir.py](https://huggingface.co/PM - AI/bi - encoder_msmarco_bert - base_german/blob/main/mmarco_beir.py)), train split
GPU	NVIDIA A40 (Driver Version: 515.48.07; CUDA Version: 11.7)
Batch Size	75
Max. Sequence Length	350
Base Model	[deepset/gbert - base](https://huggingface.co/deepset/gbert - base)
Loss function	Margin MSE
Epochs	10
Evaluation Steps	10000
Warmup Steps	1000

Evaluation

The evaluation is based on germanDPR. The BEIR - compatible germanDPR dataset consists of 9275 questions with 23993 text passages for the train split. The following table shows the evaluation results:

Model	NDCG@1	NDCG@10	NDCG@100	Comment
bi - encoder_msmarco_bert - base_german (new)	0.5300 🏆	0.7196 🏆	0.7360 🏆	"OUR model"
[deepset/gbert - base - germandpr - X_encoder](https://huggingface.co/deepset/gbert - base - germandpr - ctx_encoder)	0.4828	0.6970	0.7147	"has two encoder models (one for queries and one for corpus), is SOTA approach"
[distiluse - base - multilingual - cased - v1](https://huggingface.co/sentence - transformers/distiluse - base - multilingual - cased - v1)	0.4561	0.6347	0.6613	"trained on 15 languages"
[paraphrase - multilingual - mpnet - base - v2](https://huggingface.co/sentence - transformers/paraphrase - multilingual - mpnet - base - v2)	0.4511	0.6328	0.6592	"trained on huge corpus, support for 50+ languages"
[distiluse - base - multilingual - cased - v2](https://huggingface.co/sentence - transformers/distiluse - base - multilingual - cased - v2)	0.4350	0.6103	0.6411	"trained on 50+ languages"

🔧 Technical Details

Hard Negatives

We use the MSMARCO Hard Negatives File (Provided by Nils Reimers): https://sbert.net/datasets/msmarco - hard - negatives.jsonl.gz Negative passage are hard negative examples, that were mined using different dense embedding, cross - encoder methods and lexical search methods. Contains upto 50 negatives for each of the four retrieval systems: [bm25, msmarco - distilbert - base - tas - b, msmarco - MiniLM - L - 6 - v3, msmarco - distilbert - base - v3] Each positive and negative passage comes with a score from a Cross - Encoder (msmarco - MiniLM - L - 6 - v3). This allows denoising, i.e., removing false negative passages that are actually relevant for the query.

[Source](https://github.com/beir - cellar/beir/blob/main/examples/retrieval/training/train_msmarco_v3_margin_MSE.py)

MarginMSELoss

MarginMSELoss is based on the paper of Hofstätter et al. As for MultipleNegativesRankingLoss, we have triplets: (query, passage1, passage2). In contrast to MultipleNegativesRankingLoss, passage1 and passage2 do not have to be strictly positive/negative, both can be relevant or not relevant for a given query. We then compute the Cross - Encoder score for (query, passage1) and (query, passage2). We provide scores for 160 million such pairs in our msmarco - hard - negatives dataset. We then compute the distance: CE_distance = CEScore(query, passage1) - CEScore(query, passage2) For our bi - encoder training, we encode query, passage1, and passage2 into vector spaces and then measure the dot - product between (query, passage1) and (query, passage2). Again, we measure the distance: BE_distance = DotScore(query, passage1) - DotScore(query, passage2) We then want to ensure that the distance predicted by the bi - encoder is close to the distance predicted by the cross - encoder, i.e., we optimize the mean - squared error (MSE) between CE_distance and BE_distance. An advantage of MarginMSELoss compared to MultipleNegativesRankingLoss is that we don’t require a positive and negative passage. As mentioned before, MS MARCO is redundant, and many passages contain the same or similar content. With MarginMSELoss, we can train on two relevant passages without issues: In that case, the CE_distance will be smaller and we expect that our bi - encoder also puts both passages closer in the vector space. And disadvantage of MarginMSELoss is the slower training time: We need way more epochs to get good results. In MultipleNegativesRankingLoss, with a batch size of 64, we compare one query against 128 passages. With MarginMSELoss, we compare a query only against two passages.

[Source](https://github.com/UKPLab/sentence - transformers/blob/master/examples/training/ms_marco/README.md)

📄 License

The model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご