mmlw - retrieval - roberta - large Open - source Model - Free Deployment to Facilitate Polish Information Retrieval Tasks

Home

Mmlw Retrieval Roberta Large

Developed by sdadas

MMLW (I Must Get Better Messages) is a neural text encoder for Polish, optimized for information retrieval tasks.

Text Embedding

Transformers

OtherOpen Source License:Apache-2.0 #Polish retrieval #Dense vector encoding #Knowledge distillation

Downloads 237.90k

Release Time : 10/18/2023

Model Overview

This model converts queries and passages into 1024-dimensional vectors, primarily for Polish information retrieval tasks. It employs a two-step training process: first trained via multilingual knowledge distillation, then fine-tuned on the Polish version of the MS MARCO dataset.

Model Features

Multilingual knowledge distillation

Trained using 60 million Polish-English text pairs with English FlagEmbeddings as the teacher model.

Contrastive loss fine-tuning

Fine-tuned on the Polish MS MARCO dataset with contrastive loss, employing a large-batch training strategy.

Specific prefix handling

Requires adding specific prefixes/suffixes when encoding text; queries must be prefixed with 'zapytanie:'.

Model Capabilities

Text encoding

Sentence similarity calculation

Information retrieval

Use Cases

Information retrieval

Q&A system

Used to build Polish Q&A systems, matching questions with the most relevant answers.

Can accurately identify the most relevant answers to queries.

Document retrieval

Retrieves the most relevant documents from a large collection of Polish texts based on queries.

🚀 MMLW-retrieval-roberta-large

MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. This model is optimized for information retrieval tasks.

🚀 Quick Start

MMLW (muszę mieć lepszą wiadomość) are neural text encoders designed for the Polish language. This model is specifically optimized for information retrieval tasks. It has the ability to transform queries and passages into 1024-dimensional vectors.

The development of this model followed a two - step procedure:

First, it was initialized with a Polish RoBERTa checkpoint. Then, it was trained using the multilingual knowledge distillation method on a diverse corpus of 60 million Polish - English text pairs. For the distillation process, English FlagEmbeddings (BGE) were used as teacher models.
In the second step, the obtained models were fine - tuned with contrastive loss on the Polish MS MARCO training split. To enhance the efficiency of contrastive training, large batch sizes were employed: 1152 for small models, 768 for base models, and 288 for large models. The fine - tuning was carried out on a cluster of 12 A100 GPUs.

⚠️ Important Note

2023 - 12 - 26: We have updated the model to a new version with improved results. You can still download the previous version using the v1 tag: AutoModel.from_pretrained("sdadas/mmlw-retrieval-roberta-large", revision="v1")

✨ Features

Optimized for information retrieval tasks in Polish.
Can transform text to 1024 - dimensional vectors.
Developed through a two - step procedure involving knowledge distillation and fine - tuning.

💻 Usage Examples

Basic Usage

⚠️ Important Note

Our dense retrievers require the use of specific prefixes and suffixes when encoding texts. For this model, each query should be preceded by the prefix "zapytanie: "

You can use the model like this with sentence - transformers:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

query_prefix = "zapytanie: "
answer_prefix = ""
queries = [query_prefix + "Jak dożyć 100 lat?"]
answers = [
    answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
    answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model = SentenceTransformer("sdadas/mmlw-retrieval-roberta-large")
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)

best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
print(answers[best_answer])
# Trzeba zdrowo się odżywiać i uprawiać sport.

📚 Documentation

Evaluation Results

The model achieves NDCG@10 of 58.46 on the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.

🔧 Technical Details

The model development involved two main steps:

Initialization with a Polish RoBERTa checkpoint and training on a large Polish - English text corpus using the multilingual knowledge distillation method with English FlagEmbeddings (BGE) as teacher models.
Fine - tuning on the Polish MS MARCO training split with contrastive loss, using large batch sizes and a cluster of 12 A100 GPUs.

📄 License

This model is licensed under the apache - 2.0 license.

Acknowledgements

This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative.

Citation

@article{dadas2024pirb,
  title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods}, 
  author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata},
  year={2024},
  eprint={2402.13350},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご