Polish-reranker-base-ranknet Open-source Model - Empowering Ranking in Polish Text Information Retrieval Tasks

Home

Polish Reranker Base Ranknet

Developed by sdadas

Polish text ranking model trained with RankNet loss function, suitable for information retrieval tasks

Text Embedding

Transformers

OtherOpen Source License:Apache-2.0 #Polish text ranking #RankNet optimization #Information retrieval

Downloads 332

Release Time : 2/3/2024

Model Overview

This is a Polish text ranking model trained using the RankNet loss function, primarily designed to improve query-document relevance ranking in information retrieval systems.

Model Features

RankNet training method

Uses RankNet loss function based on relative ranking of query-document pairs rather than processing each document independently

Large-scale training data

Training set contains 1.4 million queries and 10 million documents covering multiple domains

Knowledge distillation

Utilizes knowledge distillation training with large MT5-XXL teacher model

Model Capabilities

Query-document relevance scoring

Search results re-ranking

Multi-document relevance comparison

Use Cases

Information retrieval systems

Search engine results optimization

Re-rank documents returned by search engines to improve ranking of relevant documents

QA systems

Select the most relevant answer from candidate responses

Medical field

Medical QA ranking

Rank relevance of answers in medical QA systems

🚀 polish-reranker-base-ranknet

This is a Polish text ranking model. It uses the RankNet loss and is trained on a large dataset of text pairs, which includes 1.4 million queries and 10 million documents. This model can effectively rank Polish texts, providing valuable support for information retrieval tasks.

✨ Features

Training Data Diversity: The training data consists of multiple parts, including the Polish MS MARCO training split (800k queries), the ELI5 dataset translated to Polish (over 500k queries), and a collection of Polish medical questions and answers (approximately 100k queries).
Teacher - Student Model Architecture: As a teacher model, unicamp-dl/mt5-13b-mmarco-100k, a large multilingual reranker based on the MT5 - XXL architecture, is employed. The student model is Polish RoBERTa.
RankNet Loss Method: Different from pointwise losses, the RankNet method computes loss based on queries and pairs of documents, specifically on the relative order of documents sorted by their relevance to the query.

📦 Installation

Since this model can be used with existing libraries like sentence-transformers and Huggingface Transformers, you need to install these libraries first:

pip install sentence-transformers transformers

💻 Usage Examples

Basic Usage with Sentence - Transformers

You can use the model like this with sentence-transformers:

from sentence_transformers import CrossEncoder
import torch.nn

query = "Jak dożyć 100 lat?"
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]

model = CrossEncoder(
    "sdadas/polish-reranker-base-ranknet",
    default_activation_function=torch.nn.Identity(),
    max_length=512,
    device="cuda" if torch.cuda.is_available() else "cpu"
)
pairs = [[query, answer] for answer in answers]
results = model.predict(pairs)
print(results.tolist())

Basic Usage with Huggingface Transformers

The model can also be used with Huggingface Transformers in the following way:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np

query = "Jak dożyć 100 lat?"
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]

model_name = "sdadas/polish-reranker-base-ranknet"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
texts = [f"{query}</s></s>{answer}" for answer in answers]
tokens = tokenizer(texts, padding="longest", max_length=512, truncation=True, return_tensors="pt")
output = model(**tokens)
results = output.logits.detach().numpy()
results = np.squeeze(results)
print(results.tolist())

📚 Documentation

Evaluation Results

The model achieves NDCG@10 of 60.32 in the Rerankers category of the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.

📄 License

This project is licensed under the Apache - 2.0 license.

📖 Citation

@article{dadas2024assessing,
  title={Assessing generalization capability of text ranking models in Polish}, 
  author={Sławomir Dadas and Małgorzata Grębowiec},
  year={2024},
  eprint={2402.14318},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご