Polish-reranker-base-mse Open-source Polish Text Ranking Model - Free Deployment for Efficient Text Ranking

Polish Reranker Base Mse

Developed by sdadas

This is a Polish text ranking model trained using Mean Squared Error (MSE) distillation method, with a training dataset containing 1.4 million queries and 10 million document text pairs.

Text Embedding

Transformers

OtherOpen Source License:Apache-2.0 #Polish text ranking #MSE distillation training #Information retrieval optimization

Downloads 16

Release Time : 2/3/2024

Model Overview

This model is a Polish text ranking model primarily used for text relevance ranking in information retrieval tasks. It is trained via MSE distillation and can reproduce the ranking results of large teacher models.

Model Features

MSE distillation training

Trained using Mean Squared Error distillation, effectively reproducing the ranking results of large teacher models

Large-scale training data

Training dataset includes 1.4 million queries and 10 million document text pairs

Multi-domain coverage

Training data covers general search, Q&A, and medical domains

Model Capabilities

Text relevance ranking

Information retrieval

Q&A system support

Use Cases

Information retrieval

Search engine result ranking

Ranking search engine results by relevance

Improves the relevance of search results

Q&A systems

Answer ranking

Ranking multiple candidate answers generated by a Q&A system

Selects the most relevant answer

🚀 polish-reranker-base-mse

This is a Polish text ranking model. It's trained using the mean squared error (MSE) distillation method on a large dataset of text pairs, which includes 1.4 million queries and 10 million documents. The model can effectively rank Polish texts, providing valuable support for information retrieval tasks.

🚀 Quick Start

This Polish text ranking model is trained using the mean squared error (MSE) distillation method on a large dataset of text pairs, which consists of 1.4 million queries and 10 million documents. The training data includes the following parts:

The Polish MS MARCO training split (800k queries).
The ELI5 dataset translated to Polish (over 500k queries).
A collection of Polish medical questions and answers (approximately 100k queries).

As a teacher model, unicamp-dl/mt5-13b-mmarco-100k, a large multilingual reranker based on the MT5-XXL architecture, is employed. As a student model, Polish RoBERTa is chosen. In the MSE method, the student is trained to directly replicate the outputs returned by the teacher.

✨ Features

Large-scale Training: Trained on a large dataset of text pairs with 1.4 million queries and 10 million documents.
Diverse Training Data: The training data includes multiple sources such as the Polish MS MARCO training split, the ELI5 dataset translated to Polish, and a collection of Polish medical questions and answers.
Effective Distillation Method: Uses the mean squared error (MSE) distillation method, where the student model is trained to directly replicate the outputs of the teacher model.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage (Sentence-Transformers)

You can use the model like this with sentence-transformers:

from sentence_transformers import CrossEncoder
import torch.nn

query = "Jak dożyć 100 lat?"
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]

model = CrossEncoder(
    "sdadas/polish-reranker-base-mse",
    default_activation_function=torch.nn.Identity(),
    max_length=512,
    device="cuda" if torch.cuda.is_available() else "cpu"
)
pairs = [[query, answer] for answer in answers]
results = model.predict(pairs)
print(results.tolist())

Basic Usage (Huggingface Transformers)

The model can also be used with Huggingface Transformers in the following way:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np

query = "Jak dożyć 100 lat?"
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]

model_name = "sdadas/polish-reranker-base-mse"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
texts = [f"{query}</s></s>{answer}" for answer in answers]
tokens = tokenizer(texts, padding="longest", max_length=512, truncation=True, return_tensors="pt")
output = model(**tokens)
results = output.logits.detach().numpy()
results = np.squeeze(results)
print(results.tolist())

📚 Documentation

Evaluation Results

The model achieves NDCG@10 of 57.50 in the Rerankers category of the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.

🔧 Technical Details

The model is trained using the mean squared error (MSE) distillation method. The teacher model is unicamp-dl/mt5-13b-mmarco-100k, a large multilingual reranker based on the MT5-XXL architecture. The student model is Polish RoBERTa. In the MSE method, the student is trained to directly replicate the outputs returned by the teacher.

📄 License

This model is licensed under the Apache-2.0 license.

📖 Citation

@article{dadas2024assessing,
  title={Assessing generalization capability of text ranking models in Polish}, 
  author={Sławomir Dadas and Małgorzata Grębowiec},
  year={2024},
  eprint={2402.14318},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご