mmlw-retrieval-e5-small Open-Source Text Encoder - Facilitating Information Retrieval with Query-Paragraph Vector Conversion

Mmlw Retrieval E5 Small

Developed by sdadas

MMLW (I Must Get Better Messages) is a neural text encoder for Polish, optimized for information retrieval tasks, capable of converting queries and passages into 384-dimensional vectors.

Text Embedding

Transformers

OtherOpen Source License:Apache-2.0 #Polish retrieval #Dense vector encoding #Contrastive learning optimization

Downloads 34

Release Time : 10/18/2023

Model Overview

This model is a Polish sentence transformer primarily used for feature extraction and sentence similarity calculation, especially suitable for information retrieval tasks.

Model Features

Multilingual knowledge distillation

Trained on 60 million Polish-English text pairs, using English FlagEmbeddings as the teacher model for knowledge distillation.

Contrastive loss fine-tuning

Fine-tuned on the Polish version of the MS MARCO training set with contrastive loss, optimized for training efficiency with large batch sizes.

Prefix enhancement

Specific prefixes must be added when encoding text ('query: ' for queries, 'passage: ' for passages) to optimize retrieval performance.

Model Capabilities

Text encoding

Sentence similarity calculation

Information retrieval

Use Cases

Information retrieval

Q&A systems

Used to match user queries with relevant answer passages

Effectively identifies semantically related question-answer pairs

Document retrieval

Retrieving relevant content from large document collections

Achieved an NDCG@10 score of 52.34 on the Polish Information Retrieval Benchmark (PIRB)

🚀 MMLW-retrieval-e5-small

MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. This model is optimized for information retrieval tasks, capable of transforming queries and passages into 384-dimensional vectors.

🚀 Quick Start

This model is optimized for information retrieval tasks. It can transform queries and passages to 384 dimensional vectors.

✨ Features

Initialized with multilingual E5 checkpoint and trained with multilingual knowledge distillation method on a diverse corpus of 60 million Polish-English text pairs, using English FlagEmbeddings (BGE) as teacher models for distillation.
Fine-tuned with contrastive loss on Polish MS MARCO training split, using large batch sizes for different model sizes and conducting fine-tuning on a cluster of 12 A100 GPUs.

📚 Documentation

The model was developed using a two-step procedure:

In the first step, it was initialized with multilingual E5 checkpoint, and then trained with multilingual knowledge distillation method on a diverse corpus of 60 million Polish-English text pairs. We utilised English FlagEmbeddings (BGE) as teacher models for distillation.
The second step involved fine-tuning the obtained models with contrastrive loss on Polish MS MARCO training split. In order to improve the efficiency of contrastive training, we used large batch sizes - 1152 for small, 768 for base, and 288 for large models. Fine-tuning was conducted on a cluster of 12 A100 GPUs.

⚠️ Important Note

2023-12-26: We have updated the model to a new version with improved results. You can still download the previous version using the v1 tag: AutoModel.from_pretrained("sdadas/mmlw-retrieval-e5-small", revision="v1")

💻 Usage Examples

Basic Usage

⚠️ Important Note

Our dense retrievers require the use of specific prefixes and suffixes when encoding texts. For this model, queries should be prefixed with "query: " and passages with "passage: "

You can use the model like this with sentence-transformers:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

query_prefix = "query: "
answer_prefix = "passage: "
queries = [query_prefix + "Jak dożyć 100 lat?"]
answers = [
    answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
    answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model = SentenceTransformer("sdadas/mmlw-retrieval-e5-small")
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)

best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
print(answers[best_answer])
# Trzeba zdrowo się odżywiać i uprawiać sport.

📊 Evaluation Results

The model achieves NDCG@10 of 52.34 on the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.

🙏 Acknowledgements

This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative.

📄 License

The model is released under the apache-2.0 license.

📖 Citation

@article{dadas2024pirb,
  title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods}, 
  author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata},
  year={2024},
  eprint={2402.13350},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご