Open-source Polish text ranking model of polish-reranker-large-ranknet

Polish Reranker Large Ranknet

Developed by sdadas

This is a Polish text ranking model trained using the RankNet loss function, with a training dataset consisting of 1.4 million queries and 10 million document pairs.

Text Embedding

Transformers

OtherOpen Source License:Apache-2.0 #Polish text reranking #Information retrieval optimization #RankNet loss training

Downloads 337

Release Time : 2/3/2024

Model Overview

This model is primarily used for Polish text ranking and reranking tasks, capable of evaluating query-document relevance and performing sorting.

Model Features

Efficient training method

Trained with RankNet loss function, calculating loss based on query-document pairs rather than processing query-document pairs independently

Excellent performance

Outperforms the teacher model in Polish information retrieval benchmarks, despite having only 1/30 of the parameters and being 33 times faster in inference

Diverse training data

Training data includes Polish MS MARCO training set, ELI5 dataset translated into Polish, and Polish medical Q&A dataset

Model Capabilities

Text relevance evaluation

Query-document ranking

Information retrieval result reranking

Use Cases

Information retrieval

Search engine result optimization

Rerank search engine results to improve the ranking of the most relevant results

Achieved NDCG@10 of 62.65 in Polish Information Retrieval Benchmark (PIRB)

Q&A systems

Q&A relevance ranking

Rank multiple answers returned by a Q&A system by relevance

🚀 polish-reranker-large-ranknet

This is a Polish text ranking model. It was trained with RankNet loss on a large dataset of text pairs, which includes 1.4 million queries and 10 million documents. The model offers high - efficiency text ranking capabilities, outperforming its teacher model in the Polish Information Retrieval Benchmark with fewer parameters and higher speed.

✨ Features

Large - scale Training Data: The training data consists of three parts:
- The Polish MS MARCO training split (800k queries).
- The ELI5 dataset translated to Polish (over 500k queries).
- A collection of Polish medical questions and answers (approximately 100k queries).
Teacher - Student Training Strategy: Employed [unicamp - dl/mt5 - 13b - mmarco - 100k](https://huggingface.co/unicamp - dl/mt5 - 13b - mmarco - 100k) as the teacher model and [Polish RoBERTa](https://huggingface.co/sdadas/polish - roberta - large - v2) as the student model.
RankNet Loss: Unlike pointwise losses, RankNet computes loss based on queries and pairs of documents, considering the relative order of documents sorted by their relevance to the query.
High Efficiency: The provided model outperforms the teacher model on the Polish Information Retrieval Benchmark, despite having 30 times fewer parameters and being 33 times faster than the teacher.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage (Sentence - Transformers)

You can use the model like this with sentence - transformers:

from sentence_transformers import CrossEncoder
import torch.nn

query = "Jak dożyć 100 lat?"
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]

model = CrossEncoder(
    "sdadas/polish-reranker-large-ranknet",
    default_activation_function=torch.nn.Identity(),
    max_length=512,
    device="cuda" if torch.cuda.is_available() else "cpu"
)
pairs = [[query, answer] for answer in answers]
results = model.predict(pairs)
print(results.tolist())

Advanced Usage (Huggingface Transformers)

The model can also be used with Huggingface Transformers in the following way:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np

query = "Jak dożyć 100 lat?"
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]

model_name = "sdadas/polish-reranker-large-ranknet"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
texts = [f"{query}</s></s>{answer}" for answer in answers]
tokens = tokenizer(texts, padding="longest", max_length=512, truncation=True, return_tensors="pt")
output = model(**tokens)
results = output.logits.detach().numpy()
results = np.squeeze(results)
print(results.tolist())

📚 Documentation

Evaluation Results

The model achieves NDCG@10 of 62.65 in the Rerankers category of the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.

📄 License

The model is licensed under the apache - 2.0 license.

📚 Citation

@article{dadas2024assessing,
  title={Assessing generalization capability of text ranking models in Polish}, 
  author={Sławomir Dadas and Małgorzata Grębowiec},
  year={2024},
  eprint={2402.14318},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

💡 Usage Tip

The method has proven to be highly effective. The provided model outperforms the teacher model on the Polish Information Retrieval Benchmark, despite having 30 times fewer parameters and being 33 times faster than the teacher!

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご