🚀 MMLW-retrieval-e5-small
MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. This model is optimized for information retrieval tasks, capable of transforming queries and passages into 384-dimensional vectors.
🚀 Quick Start
This model is optimized for information retrieval tasks. It can transform queries and passages to 384 dimensional vectors.
✨ Features
📚 Documentation
The model was developed using a two-step procedure:
- In the first step, it was initialized with multilingual E5 checkpoint, and then trained with multilingual knowledge distillation method on a diverse corpus of 60 million Polish-English text pairs. We utilised English FlagEmbeddings (BGE) as teacher models for distillation.
- The second step involved fine-tuning the obtained models with contrastrive loss on Polish MS MARCO training split. In order to improve the efficiency of contrastive training, we used large batch sizes - 1152 for small, 768 for base, and 288 for large models. Fine-tuning was conducted on a cluster of 12 A100 GPUs.
⚠️ Important Note
2023-12-26: We have updated the model to a new version with improved results. You can still download the previous version using the v1 tag: AutoModel.from_pretrained("sdadas/mmlw-retrieval-e5-small", revision="v1")
💻 Usage Examples
Basic Usage
⚠️ Important Note
Our dense retrievers require the use of specific prefixes and suffixes when encoding texts. For this model, queries should be prefixed with "query: " and passages with "passage: "
You can use the model like this with sentence-transformers:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
query_prefix = "query: "
answer_prefix = "passage: "
queries = [query_prefix + "Jak dożyć 100 lat?"]
answers = [
answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model = SentenceTransformer("sdadas/mmlw-retrieval-e5-small")
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)
best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
print(answers[best_answer])
📊 Evaluation Results
The model achieves NDCG@10 of 52.34 on the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.
🙏 Acknowledgements
This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative.
📄 License
The model is released under the apache-2.0 license.
📖 Citation
@article{dadas2024pirb,
title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods},
author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata},
year={2024},
eprint={2402.13350},
archivePrefix={arXiv},
primaryClass={cs.CL}
}