mmlw - retrieval - roberta - large - v2 open-source model - Optimize Polish information retrieval and implement the conversion of query paragraph vectors

Mmlw Retrieval Roberta Large V2

Developed by sdadas

MMLW is a neural text encoder for Polish, optimized for information retrieval tasks, capable of converting queries and paragraphs into 1024-dimensional vectors.

Text Embedding Other#Polish retrieval optimization #Multilingual knowledge distillation #High-dimensional semantic encoding

Downloads 2,091

Release Time : 3/23/2025

Model Overview

This model is based on polish-roberta-large-v2. Through multilingual knowledge distillation and contrastive loss fine-tuning, modern English retrievers and re-rankers based on large language models are integrated, improving the performance.

Model Features

Multilingual knowledge distillation

Knowledge distillation is performed using stella_en_1.5B_v5 as the teacher model, improving the model performance.

Contrastive loss fine-tuning

A dataset with over 4 million queries is used for fine-tuning through contrastive loss, optimizing the information retrieval effect.

High-dimensional vector representation

It can convert queries and paragraphs into 1024-dimensional vectors, suitable for information retrieval tasks.

Model Capabilities

Information retrieval

Semantic text similarity calculation

Use Cases

Information retrieval

Polish document retrieval

Match user queries with paragraphs in the document library and return the most relevant documents.

An NDCG@10 of 60.71 was achieved in the Polish information retrieval benchmark test.

Semantic similarity

Polish sentence similarity calculation

Calculate the semantic similarity between two Polish sentences.

🚀 MMLW-retrieval-roberta-large-v2

MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. The second version is based on the same foundational model (polish-roberta-large-v2). By incorporating modern LLM-based English retrievers and rerankers in the training process, it has achieved improved results. This model is optimized for information retrieval tasks and can transform queries and passages into 1024-dimensional vectors.

✨ Features

Optimized for information retrieval tasks.
Can transform queries and passages to 1024 dimensional vectors.
Developed using a two - step training procedure for better performance.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

The model supports both information retrieval and semantic textual similarity. For retrieval, queries should be prefixed with "[query]: ". For symmetric tasks such as semantic similarity, both texts should be prefixed with "[sts]: ".

Please note that the model uses a custom implementation, so you should add trust_remote_code=True argument when loading it. It is also recommended to use Flash Attention 2, which can be enabled with attn_implementation argument.

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer(
    "sdadas/mmlw-retrieval-roberta-large-v2",
    trust_remote_code=True,
    device="cuda",
    model_kwargs={"attn_implementation": "flash_attention_2", "trust_remote_code": True}
)
# Flash-Attention works only in 16-bit mode, so we need to cast the model to float16 or bfloat16
model.bfloat16()

# Retrieval example
query_prefix = "[query]: "
queries = [query_prefix + "Jak dożyć 100 lat?"]
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)
best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
print(answers[best_answer])

# Semantic similarity example
sim_prefix = "[sts]: "
sentences = [
    sim_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
    sim_prefix + "Warto jest prowadzić zdrowy tryb życia, uwzględniający aktywność fizyczną i dietę.",
    sim_prefix + "One should eat healthy and engage in sports.",
    sim_prefix + "Zakupy potwierdzasz PINem, który bezpiecznie ustalisz podczas aktywacji."
]
emb = model.encode(sentences, convert_to_tensor=True, show_progress_bar=False)
print(cos_sim(emb, emb))

🔧 Technical Details

The model was developed using a two-step procedure:

In the first step, it was initialized with Polish RoBERTa checkpoint, and then trained with multilingual knowledge distillation method on a diverse corpus of 20 million Polish-English text pairs. stella_en_1.5B_v5 was utilised as the teacher models for distillation.
The second step involved fine-tuning the model with contrastive loss using a dataset consisting of over 4 million queries. Positive and negative passages for each query have been selected with the help of BAAI/bge-reranker-v2.5-gemma2-lightweight reranker.

📚 Documentation

Evaluation Results

The model achieves NDCG@10 of 60.71 on the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.

📄 License

The license of this model is gemma.

📖 Citation

@inproceedings{dadas2024pirb,
  title={PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods},
  author={Dadas, Slawomir and Pere{\l}kiewicz, Micha{\l} and Po{\'s}wiata, Rafa{\l}},
  booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
  pages={12761--12774},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご