🚀 polish-reranker-large-ranknet
This is a Polish text ranking model. It was trained with RankNet loss on a large dataset of text pairs, which includes 1.4 million queries and 10 million documents. The model offers high - efficiency text ranking capabilities, outperforming its teacher model in the Polish Information Retrieval Benchmark with fewer parameters and higher speed.
✨ Features
- Large - scale Training Data: The training data consists of three parts:
- The Polish MS MARCO training split (800k queries).
- The ELI5 dataset translated to Polish (over 500k queries).
- A collection of Polish medical questions and answers (approximately 100k queries).
- Teacher - Student Training Strategy: Employed [unicamp - dl/mt5 - 13b - mmarco - 100k](https://huggingface.co/unicamp - dl/mt5 - 13b - mmarco - 100k) as the teacher model and [Polish RoBERTa](https://huggingface.co/sdadas/polish - roberta - large - v2) as the student model.
- RankNet Loss: Unlike pointwise losses, RankNet computes loss based on queries and pairs of documents, considering the relative order of documents sorted by their relevance to the query.
- High Efficiency: The provided model outperforms the teacher model on the Polish Information Retrieval Benchmark, despite having 30 times fewer parameters and being 33 times faster than the teacher.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage (Sentence - Transformers)
You can use the model like this with sentence - transformers:
from sentence_transformers import CrossEncoder
import torch.nn
query = "Jak dożyć 100 lat?"
answers = [
"Trzeba zdrowo się odżywiać i uprawiać sport.",
"Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
"Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model = CrossEncoder(
"sdadas/polish-reranker-large-ranknet",
default_activation_function=torch.nn.Identity(),
max_length=512,
device="cuda" if torch.cuda.is_available() else "cpu"
)
pairs = [[query, answer] for answer in answers]
results = model.predict(pairs)
print(results.tolist())
Advanced Usage (Huggingface Transformers)
The model can also be used with Huggingface Transformers in the following way:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np
query = "Jak dożyć 100 lat?"
answers = [
"Trzeba zdrowo się odżywiać i uprawiać sport.",
"Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
"Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model_name = "sdadas/polish-reranker-large-ranknet"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
texts = [f"{query}</s></s>{answer}" for answer in answers]
tokens = tokenizer(texts, padding="longest", max_length=512, truncation=True, return_tensors="pt")
output = model(**tokens)
results = output.logits.detach().numpy()
results = np.squeeze(results)
print(results.tolist())
📚 Documentation
Evaluation Results
The model achieves NDCG@10 of 62.65 in the Rerankers category of the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.
📄 License
The model is licensed under the apache - 2.0 license.
📚 Citation
@article{dadas2024assessing,
title={Assessing generalization capability of text ranking models in Polish},
author={Sławomir Dadas and Małgorzata Grębowiec},
year={2024},
eprint={2402.14318},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
💡 Usage Tip
The method has proven to be highly effective. The provided model outperforms the teacher model on the Polish Information Retrieval Benchmark, despite having 30 times fewer parameters and being 33 times faster than the teacher!