đ Cross-Encoder for multilingual MS Marco
This is a cross-encoder model trained on the multilingual MS Marco dataset, which can be used for information retrieval tasks.
đ Quick Start
This model was trained on the MMARCO dataset, a machine - translated version of MS MARCO into 14 languages using Google Translate. Experiments show it performs well for other languages too. The multilingual MiniLMv2 model was used as the base model.
The model can be applied to Information Retrieval. Given a query, encode it with all possible passages (e.g., retrieved via ElasticSearch) and then sort the passages in descending order. For more details, refer to SBERT.net Retrieve & Re - rank. The training code is available at SBERT.net Training MS Marco.
⨠Features
- Multilingual Support: Supports 14 languages including English, Arabic, Chinese, Dutch, French, German, Hindi, Indonesian, Italian, Japanese, Portuguese, Russian, Spanish, and Vietnamese, as well as multilingual scenarios.
- Based on High - quality Datasets: Trained on the MMARCO dataset for better performance.
- Versatile Usage: Can be used for information retrieval tasks.
đĻ Installation
No specific installation steps for the model itself are provided in the original document. However, to use the model, you need to install relevant libraries such as sentence - transformers
or transformers
. You can install them using the following commands:
pip install sentence-transformers
pip install transformers
đģ Usage Examples
Basic Usage with SentenceTransformers
When you have SentenceTransformers installed, you can use the pre - trained models as follows:
from sentence_transformers import CrossEncoder
model = CrossEncoder('model_name')
scores = model.predict([('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')])
Basic Usage with Transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model = AutoModelForSequenceClassification.from_pretrained('model_name')
tokenizer = AutoTokenizer.from_pretrained('model_name')
features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'], padding=True, truncation=True, return_tensors="pt")
model.eval()
with torch.no_grad():
scores = model(**features).logits
print(scores)
đ License
This project is licensed under the Apache - 2.0 license.
đ Documentation
Model Information
Property |
Details |
Model Type |
Cross - Encoder for multilingual MS Marco |
Supported Languages |
English, Arabic, Chinese, Dutch, French, German, Hindi, Indonesian, Italian, Japanese, Portuguese, Russian, Spanish, Vietnamese, Multilingual |
Datasets |
unicamp - dl/mmarco |
Base Model |
nreimers/mMiniLMv2 - L12 - H384 - distilled - from - XLMR - Large |
Pipeline Tag |
text - ranking |
Library Name |
sentence - transformers |
Tags |
transformers |