polish-reranker-large-mse: Open-Source Polish Text Ranking Model - Accurately and Efficiently Handle the Ranking of a Large Number of Text Pairs

Polish Reranker Large Mse

Developed by sdadas

This is a Polish text ranking model trained using Mean Squared Error (MSE) distillation method, with a training dataset consisting of 1.4 million queries and 10 million document pairs.

Text Embedding

Transformers

OtherOpen Source License:Apache-2.0 #Polish Text Ranking #MSE Distillation Training #Multi-domain QA Optimization

Downloads 17

Release Time : 2/3/2024

Model Overview

This model is a Polish text ranking model primarily used for information retrieval tasks, capable of ranking the relevance between queries and documents.

Model Features

MSE Distillation Training

Trained using Mean Squared Error (MSE) distillation method, where the student model learns by directly replicating the teacher model's output.

Large-scale Training Data

The training dataset includes 1.4 million queries and 10 million document pairs, covering multiple domains.

Multi-domain Adaptability

Training data includes the Polish MS MARCO training set, ELI5 dataset translated into Polish, and Polish medical QA datasets, making it suitable for various domains.

Model Capabilities

Text Ranking

Information Retrieval

Query-Document Relevance Scoring

Use Cases

Information Retrieval

Search Engine Result Ranking

Rank the relevance of search engine results to improve user experience.

QA Systems

Rank candidate answers in QA systems to select the most relevant answer.

Medical Information Retrieval

Medical QA Ranking

Rank medical-related queries and documents to help users obtain the most relevant medical information.

🚀 polish-reranker-large-mse

This is a Polish text ranking model. It uses the mean squared error (MSE) distillation method and is trained on a large dataset of text pairs, which includes 1.4 million queries and 10 million documents. It can effectively rank Polish texts, providing valuable support for information retrieval tasks.

🚀 Quick Start

This is a Polish text ranking model trained using the mean squared error (MSE) distillation method on a large dataset of text pairs consisting of 1.4 million queries and 10 million documents. The training data included the following parts:

The Polish MS MARCO training split (800k queries);
The ELI5 dataset translated to Polish (over 500k queries);
A collection of Polish medical questions and answers (approximately 100k queries).

As a teacher model, we employed unicamp-dl/mt5-13b-mmarco-100k, a large multilingual reranker based on the MT5-XXL architecture. As a student model, we choose Polish RoBERTa. In the MSE method, the student is trained to directly replicate the outputs returned by the teacher.

✨ Features

Large - scale Training: Trained on a large dataset of text pairs with 1.4 million queries and 10 million documents.
Diverse Training Data: The training data comes from multiple sources, including Polish MS MARCO, translated ELI5 dataset, and Polish medical Q&A.
MSE Distillation Method: Uses the mean squared error (MSE) distillation method, where the student model replicates the teacher model's outputs.

📦 Installation

The installation depends on the library you use. For sentence - transformers and transformers, you can install them via pip:

pip install sentence-transformers transformers

💻 Usage Examples

Basic Usage (Sentence - Transformers)

You can use the model like this with sentence-transformers:

from sentence_transformers import CrossEncoder
import torch.nn

query = "Jak dożyć 100 lat?"
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]

model = CrossEncoder(
    "sdadas/polish-reranker-large-mse",
    default_activation_function=torch.nn.Identity(),
    max_length=512,
    device="cuda" if torch.cuda.is_available() else "cpu"
)
pairs = [[query, answer] for answer in answers]
results = model.predict(pairs)
print(results.tolist())

Advanced Usage (Huggingface Transformers)

The model can also be used with Huggingface Transformers in the following way:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np

query = "Jak dożyć 100 lat?"
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]

model_name = "sdadas/polish-reranker-large-mse"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
texts = [f"{query}</s></s>{answer}" for answer in answers]
tokens = tokenizer(texts, padding="longest", max_length=512, truncation=True, return_tensors="pt")
output = model(**tokens)
results = output.logits.detach().numpy()
results = np.squeeze(results)
print(results.tolist())

📚 Documentation

Evaluation Results

The model achieves NDCG@10 of 60.27 in the Rerankers category of the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.

Citation

@article{dadas2024assessing,
  title={Assessing generalization capability of text ranking models in Polish}, 
  author={Sławomir Dadas and Małgorzata Grębowiec},
  year={2024},
  eprint={2402.14318},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご