🚀 MMLW-retrieval-roberta-large-v2
MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. The second version is based on the same foundational model (polish-roberta-large-v2). By incorporating modern LLM-based English retrievers and rerankers in the training process, it has achieved improved results. This model is optimized for information retrieval tasks and can transform queries and passages into 1024-dimensional vectors.
✨ Features
- Optimized for information retrieval tasks.
- Can transform queries and passages to 1024 dimensional vectors.
- Developed using a two - step training procedure for better performance.
📦 Installation
The README does not provide specific installation steps, so this section is skipped.
💻 Usage Examples
Basic Usage
The model supports both information retrieval and semantic textual similarity. For retrieval, queries should be prefixed with "[query]: ". For symmetric tasks such as semantic similarity, both texts should be prefixed with "[sts]: ".
Please note that the model uses a custom implementation, so you should add trust_remote_code=True
argument when loading it. It is also recommended to use Flash Attention 2, which can be enabled with attn_implementation
argument.
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer(
"sdadas/mmlw-retrieval-roberta-large-v2",
trust_remote_code=True,
device="cuda",
model_kwargs={"attn_implementation": "flash_attention_2", "trust_remote_code": True}
)
model.bfloat16()
query_prefix = "[query]: "
queries = [query_prefix + "Jak dożyć 100 lat?"]
answers = [
"Trzeba zdrowo się odżywiać i uprawiać sport.",
"Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
"Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)
best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
print(answers[best_answer])
sim_prefix = "[sts]: "
sentences = [
sim_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
sim_prefix + "Warto jest prowadzić zdrowy tryb życia, uwzględniający aktywność fizyczną i dietę.",
sim_prefix + "One should eat healthy and engage in sports.",
sim_prefix + "Zakupy potwierdzasz PINem, który bezpiecznie ustalisz podczas aktywacji."
]
emb = model.encode(sentences, convert_to_tensor=True, show_progress_bar=False)
print(cos_sim(emb, emb))
🔧 Technical Details
The model was developed using a two-step procedure:
- In the first step, it was initialized with Polish RoBERTa checkpoint, and then trained with multilingual knowledge distillation method on a diverse corpus of 20 million Polish-English text pairs. stella_en_1.5B_v5 was utilised as the teacher models for distillation.
- The second step involved fine-tuning the model with contrastive loss using a dataset consisting of over 4 million queries. Positive and negative passages for each query have been selected with the help of BAAI/bge-reranker-v2.5-gemma2-lightweight reranker.
📚 Documentation
Evaluation Results
The model achieves NDCG@10 of 60.71 on the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.
📄 License
The license of this model is gemma.
📖 Citation
@inproceedings{dadas2024pirb,
title={PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods},
author={Dadas, Slawomir and Pere{\l}kiewicz, Micha{\l} and Po{\'s}wiata, Rafa{\l}},
booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
pages={12761--12774},
year={2024}
}