mmarco-mMiniLMv2-L12-H384-v1 Open-source Multilingual Text Ranking Model - Supports Information Retrieval in 14 Languages

Mmarco Mminilmv2 L12 H384 V1

Developed by cross-encoder

A multilingual text ranking model trained on the MMARCO dataset, supporting information retrieval tasks in 14 languages

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual retrieval #Information re-ranking #Cross-lingual semantic matching

Downloads 42.56k

Release Time : 6/1/2022

Model Overview

This model is a multilingual cross-encoder specifically designed for information retrieval scenarios. Given a query, it can encode all possible passages and rank them by score, suitable for re-ranking tasks in multilingual search engines.

Model Features

Multilingual support

Supports text ranking tasks in 14 languages, with excellent performance on the MMARCO dataset

Efficient architecture

Lightweight architecture based on MiniLMv2, with 12 Transformer layers and a 384-dimensional hidden layer

Information retrieval optimization

Specifically designed for query-passage relevance scoring tasks in search engines

Model Capabilities

Multilingual text ranking

Query-passage relevance scoring

Information retrieval result re-ranking

Use Cases

Search engines

Multilingual search result re-ranking

Re-ranking the relevance of results returned by retrieval systems like ElasticSearch

Improves the relevance and accuracy of search results

Question answering systems

Candidate answer ranking

Ranking multiple candidate answers generated by a question answering system by relevance

Helps the system select the most relevant answer

🚀 Cross-Encoder for multilingual MS Marco

This is a cross-encoder model trained on the multilingual MS Marco dataset, which can be used for information retrieval tasks.

🚀 Quick Start

This model was trained on the MMARCO dataset, a machine - translated version of MS MARCO into 14 languages using Google Translate. Experiments show it performs well for other languages too. The multilingual MiniLMv2 model was used as the base model.

The model can be applied to Information Retrieval. Given a query, encode it with all possible passages (e.g., retrieved via ElasticSearch) and then sort the passages in descending order. For more details, refer to SBERT.net Retrieve & Re - rank. The training code is available at SBERT.net Training MS Marco.

✨ Features

Multilingual Support: Supports 14 languages including English, Arabic, Chinese, Dutch, French, German, Hindi, Indonesian, Italian, Japanese, Portuguese, Russian, Spanish, and Vietnamese, as well as multilingual scenarios.
Based on High - quality Datasets: Trained on the MMARCO dataset for better performance.
Versatile Usage: Can be used for information retrieval tasks.

📦 Installation

No specific installation steps for the model itself are provided in the original document. However, to use the model, you need to install relevant libraries such as sentence - transformers or transformers. You can install them using the following commands:

pip install sentence-transformers
pip install transformers

💻 Usage Examples

Basic Usage with SentenceTransformers

When you have SentenceTransformers installed, you can use the pre - trained models as follows:

from sentence_transformers import CrossEncoder
model = CrossEncoder('model_name')
scores = model.predict([('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')])

Basic Usage with Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained('model_name')
tokenizer = AutoTokenizer.from_pretrained('model_name')

features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],  padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    scores = model(**features).logits
    print(scores)

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Documentation

Model Information

Property	Details
Model Type	Cross - Encoder for multilingual MS Marco
Supported Languages	English, Arabic, Chinese, Dutch, French, German, Hindi, Indonesian, Italian, Japanese, Portuguese, Russian, Spanish, Vietnamese, Multilingual
Datasets	unicamp - dl/mmarco
Base Model	nreimers/mMiniLMv2 - L12 - H384 - distilled - from - XLMR - Large
Pipeline Tag	text - ranking
Library Name	sentence - transformers
Tags	transformers

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご