🚀 MMLW-retrieval-roberta-large
MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. This model is optimized for information retrieval tasks.
🚀 Quick Start
MMLW (muszę mieć lepszą wiadomość) are neural text encoders designed for the Polish language. This model is specifically optimized for information retrieval tasks. It has the ability to transform queries and passages into 1024-dimensional vectors.
The development of this model followed a two - step procedure:
- First, it was initialized with a Polish RoBERTa checkpoint. Then, it was trained using the multilingual knowledge distillation method on a diverse corpus of 60 million Polish - English text pairs. For the distillation process, English FlagEmbeddings (BGE) were used as teacher models.
- In the second step, the obtained models were fine - tuned with contrastive loss on the Polish MS MARCO training split. To enhance the efficiency of contrastive training, large batch sizes were employed: 1152 for small models, 768 for base models, and 288 for large models. The fine - tuning was carried out on a cluster of 12 A100 GPUs.
⚠️ Important Note
2023 - 12 - 26: We have updated the model to a new version with improved results. You can still download the previous version using the v1 tag: AutoModel.from_pretrained("sdadas/mmlw-retrieval-roberta-large", revision="v1")
✨ Features
- Optimized for information retrieval tasks in Polish.
- Can transform text to 1024 - dimensional vectors.
- Developed through a two - step procedure involving knowledge distillation and fine - tuning.
💻 Usage Examples
Basic Usage
⚠️ Important Note
Our dense retrievers require the use of specific prefixes and suffixes when encoding texts. For this model, each query should be preceded by the prefix "zapytanie: "
You can use the model like this with sentence - transformers:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
query_prefix = "zapytanie: "
answer_prefix = ""
queries = [query_prefix + "Jak dożyć 100 lat?"]
answers = [
answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model = SentenceTransformer("sdadas/mmlw-retrieval-roberta-large")
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)
best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
print(answers[best_answer])
📚 Documentation
Evaluation Results
The model achieves NDCG@10 of 58.46 on the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.
🔧 Technical Details
The model development involved two main steps:
- Initialization with a Polish RoBERTa checkpoint and training on a large Polish - English text corpus using the multilingual knowledge distillation method with English FlagEmbeddings (BGE) as teacher models.
- Fine - tuning on the Polish MS MARCO training split with contrastive loss, using large batch sizes and a cluster of 12 A100 GPUs.
📄 License
This model is licensed under the apache - 2.0 license.
Acknowledgements
This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative.
Citation
@article{dadas2024pirb,
title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods},
author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata},
year={2024},
eprint={2402.13350},
archivePrefix={arXiv},
primaryClass={cs.CL}
}