BERT Multilingual Passage Reranking MSMARCO Open-source Model - Support Over 100 Languages to Optimize Search Result Ranking

Bert Multilingual Passage Reranking Msmarco

Developed by amberoad

A multilingual passage reranking model supporting over 100 languages, designed to improve the relevance ranking of search engine results

Text Embedding Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual Search Optimization #Passage Relevance Scoring #MSMARCO Fine-tuning

Downloads 4,610

Release Time : 3/2/2022

Model Overview

This BERT-based model calculates query-passage relevance scores, significantly enhancing search engine result quality. Supports multilingual processing for global search applications.

Model Features

Multilingual Support

Supports over 100 languages including major European and Asian languages

High Relevance Improvement

Can improve search result relevance by up to 100%

Elasticsearch Integration

Direct integration with Elasticsearch without additional coding

Efficient Inference

Processing speed of ~300ms per query, suitable for real-time applications

Model Capabilities

Multilingual text understanding

Query-passage relevance scoring

Search result reranking

Cross-language information retrieval

Use Cases

Search Engine Optimization

Enterprise Search Improvement

Enhances relevance for internal document search systems

Up to 100% relevance improvement

E-commerce Search

Improves product search accuracy on e-commerce platforms

Enhances user efficiency in finding relevant products

Multilingual Applications

Global Content Retrieval

Provides unified search solutions for multilingual websites

Search result optimization for 100+ languages

🚀 Passage Reranking Multilingual BERT 🔃 🌍

This model supports over 100 languages and can calculate if a passage matches a search query, improving Elasticsearch results.

🚀 Quick Start

How to use

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("amberoad/bert-multilingual-passage-reranking-msmarco")

model = AutoModelForSequenceClassification.from_pretrained("amberoad/bert-multilingual-passage-reranking-msmarco")

This Model can be used as a drop-in replacement in the Nboost Library. Through this you can directly improve your Elasticsearch Results without any coding.

✨ Features

Multilingual Support: Supports over 100 languages. See List of supported languages for all available.
Passage Reranking: Takes a search query and a passage and calculates if the passage matches the query. It can be used as an improvement for Elasticsearch Results and boosts the relevancy by up to 100%.

📦 Installation

The installation can be achieved by using the transformers library in Python. You can install the necessary dependencies with the following command:

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("amberoad/bert-multilingual-passage-reranking-msmarco")
model = AutoModelForSequenceClassification.from_pretrained("amberoad/bert-multilingual-passage-reranking-msmarco")

query = "What is a corporation?"
passage = "A company is incorporated in a specific nation, often within the bounds of a smaller subset of that nation, such as a state or province. The corporation is then governed by the laws of incorporation in that state. A corporation may issue stock, either private or public, or may be classified as a non - stock corporation. If stock is issued, the corporation will usually be governed by its shareholders, either directly or indirectly."

inputs = tokenizer(query, passage, return_tensors='pt')
outputs = model(**inputs)
print(outputs.logits)

📚 Documentation

Model description

Input: Supports over 100 Languages. See List of supported languages for all available.
Purpose: This module takes a search query [1] and a passage [2] and calculates if the passage matches the query. It can be used as an improvement for Elasticsearch Results and boosts the relevancy by up to 100%.
Architecture: On top of BERT there is a Densly Connected NN which takes the 768 Dimensional [CLS] Token as input and provides the output (Arxiv).
Output: Just a single value between between -10 and 10. Better matching query, passage pairs tend to have a higher a score.

Intended uses & limitations

Both query[1] and passage[2] have to fit in 512 Tokens. As you normally want to rerank the first dozens of search results keep in mind the inference time of approximately 300 ms/query.

Training data

This model is trained using the Microsoft MS Marco Dataset. This training dataset contains approximately 400M tuples of a query, relevant and non - relevant passages. All datasets used for training and evaluating are listed in this table. The used dataset for training is called Train Triples Large, while the evaluation was made on Top 1000 Dev. There are 6,900 queries in total in the development dataset, where each query is mapped to top 1,000 passage retrieved using BM25 from MS MARCO corpus.

Training procedure

The training is performed the same way as stated in this README. See their excellent Paper on Arxiv. We changed the BERT Model from an English only to the default BERT Multilingual uncased Model from Google. Training was done 400 000 Steps. This equaled 12 hours an a TPU V3 - 8.

Eval results

We see nearly similar performance than the English only Model in the English Bing Queries Dataset. Although the training data is English only internal Tests on private data showed a far higher accurancy in German than all other available models.

Fine - tuned Models	Eval Set	Search Boost	Speed on GPU
`amberoad/Multilingual-uncased-MSMARCO` (This Model)	bing queries	+61% _{^{(0.29 vs 0.18)}}	~300 ms/query
`nboost/pt-tinybert-msmarco`	bing queries	+45% _{^{(0.26 vs 0.18)}}	~50ms/query
`nboost/pt-bert-base-uncased-msmarco`	bing queries	+62% _{^{(0.29 vs 0.18)}}	~300 ms/query
`nboost/pt-bert-large-msmarco`	bing queries	+77% _{^{(0.32 vs 0.18)}}	-
`nboost/pt-biobert-base-msmarco`	biomed	+66% _{^{(0.17 vs 0.10)}}	~300 ms/query

This table is taken from nboost and extended by the first line.

🔧 Technical Details

The model is based on BERT architecture, with a Densly Connected NN on top. It takes the 768 Dimensional [CLS] Token as input. The training process follows the method described in the README and the paper on Arxiv. The BERT model is changed from English - only to the default BERT Multilingual uncased Model from Google.

📄 License

This model is released under the apache - 2.0 license.

Contact Infos

Amberoad is a company focussing on Search and Business Intelligence. We provide you:

Advanced Internal Company Search Engines thorugh NLP
External Search Egnines: Find Competitors, Customers, Suppliers

Get in Contact now to benefit from our Expertise:

The training and evaluation was performed by Philipp Reissel and Igli Manaj

Amberoad Linkedin | Homepage | Email

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご