Silver Retriever: Open-Source Polish Neural Retrieval Model - Free Deployment for Sentence and Paragraph Retrieval

Silver Retriever Base V1

Developed by ipipan

Silver Retriever is a neural retrieval model specifically designed for Polish language, focusing on sentence similarity and paragraph retrieval tasks.

Text Embedding

Transformers

Other#Polish semantic retrieval #High-precision paragraph matching #Q&A system optimization

Downloads 554

Release Time : 8/16/2023

Model Overview

This model encodes Polish sentences or paragraphs into a 768-dimensional dense vector space, suitable for tasks like document retrieval or semantic search. Initialized based on HerBERT-base and fine-tuned on PolQA and MAUPQA datasets.

Model Features

Efficient Paragraph Retrieval

Paragraph retrieval capability optimized for Polish language, demonstrating excellent performance across multiple Polish datasets

768-Dimensional Dense Vectors

Encodes sentences or paragraphs into 768-dimensional dense vectors, ideal for semantic search tasks

Multi-Dataset Training

Fine-tuned on PolQA and MAUPQA datasets, enhancing model performance

Model Capabilities

Sentence similarity calculation

Paragraph retrieval

Semantic search

Q&A system support

Use Cases

Information Retrieval

Polish Q&A System

Used as the retrieval component for building Polish Q&A systems

Achieved 87.24% accuracy on the PolQA dataset

Document Retrieval

Helps users quickly find relevant document paragraphs

Achieved 94.56% accuracy on the Allegro FAQ dataset

🚀 Silver Retriever Base (v1)

The Silver Retriever model encodes Polish sentences or paragraphs into a 768 - dimensional dense vector space. It can be used for tasks like document retrieval or semantic search, offering an effective solution for information retrieval in the Polish language.

image/png

🚀 Quick Start

The Silver Retriever model is initialized from the HerBERT - base model and fine - tuned on the PolQA and MAUPQA datasets for 15,000 steps with a batch size of 1,024. For more details, please refer to the SilverRetriever: Advancing Neural Passage Retrieval for Polish Question Answering.

✨ Features

Encoding Capability: Encodes Polish sentences or paragraphs into a 768 - dimensional dense vector space.
Task Suitability: Ideal for document retrieval and semantic search tasks.
Fine - Tuned: Fine - tuned on specific Polish datasets for better performance.

📦 Installation

If you want to use this model with sentence - transformers, you can install it using the following command:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

Using Sentence - Transformers

from sentence_transformers import SentenceTransformer
sentences = [
    "Pytanie: W jakim mieście urodził się Zbigniew Herbert?", 
    "Zbigniew Herbert</s>Zbigniew Bolesław Ryszard Herbert (ur. 29 października 1924 we Lwowie, zm. 28 lipca 1998 w Warszawie) – polski poeta, eseista i dramaturg."
]

model = SentenceTransformer('ipipan/silver-retriever-base-v1')
embeddings = model.encode(sentences)
print(embeddings)

Using HuggingFace Transformers

from transformers import AutoTokenizer, AutoModel
import torch


def cls_pooling(model_output, attention_mask):
    return model_output[0][:,0]


# Sentences we want sentence embeddings for
sentences = [
    "Pytanie: W jakim mieście urodził się Zbigniew Herbert?", 
    "Zbigniew Herbert</s>Zbigniew Bolesław Ryszard Herbert (ur. 29 października 1924 we Lwowie, zm. 28 lipca 1998 w Warszawie) – polski poeta, eseista i dramaturg."
]
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('ipipan/silver-retriever-base-v1')
model = AutoModel.from_pretrained('ipipan/silver-retriever-base-v1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation

Property	Details
Model Type	Silver Retriever Base (v1)
Training Data	PolQA, MAUPQA

Model	Average [Acc]	Average [NDCG]	PolQA [Acc]	PolQA [NDCG]	Allegro FAQ [Acc]	Allegro FAQ [NDCG]	Legal Questions [Acc]	Legal Questions [NDCG]
BM25	74.87	51.81	61.35	24.51	66.89	48.71	96.38	82.21
BM25 (lemma)	80.46	55.44	71.49	31.97	75.33	55.70	94.57	78.65
MiniLM - L12 - v2	62.62	39.21	37.24	11.93	71.67	51.25	78.97	54.44
LaBSE	64.89	39.47	46.23	15.53	67.11	46.71	81.34	56.16
mContriever - Base	86.31	60.37	78.66	36.30	84.44	67.38	95.82	77.42
E5 - Base	91.58	66.56	86.61	46.08	91.89	75.90	96.24	77.69
ST - DistilRoBERTa	73.78	48.29	48.43	16.73	84.89	64.39	88.02	63.76
ST - MPNet	76.66	49.99	56.80	21.55	86.00	65.44	87.19	62.99
HerBERT - QA	84.23	54.36	75.84	32.52	85.78	63.58	91.09	66.99
Silver Retriever v1	92.45	66.72	87.24	43.40	94.56	79.66	95.54	77.10
Silver Retriever v1.1	93.18	67.55	88.60	44.88	94.00	79.83	96.94	77.95

Legend:

Acc is the Accuracy at 10
NDCG is the Normalized Discounted Cumulative Gain at 10

Usage

Preparing inputs

The model was trained on question - passage pairs and works best when the input is in the same format as that used during training:

Add the phrase Pytanie: to the beginning of the question.
The training passages consist of title and text concatenated with the special token </s>. Even if your passages don't have a title, it is still beneficial to prefix a passage with the </s> token.
Although the dot product was used during training, the model usually works better with the cosine distance.

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Additional Information

Model Creators

The model was created by Piotr Rybak from the Institute of Computer Science, Polish Academy of Sciences.

This work was supported by the European Regional Development Fund as a part of 2014–2020 Smart Growth Operational Programme, CLARIN — Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00 - 00C002/19.

Licensing Information

CC BY - SA 4.0

Citation Information

@inproceedings{rybak-ogrodniczuk-2024-silver-retriever,
    title = "Silver Retriever: Advancing Neural Passage Retrieval for {P}olish Question Answering",
    author = "Rybak, Piotr  and
      Ogrodniczuk, Maciej",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1291",
    pages = "14826--14831",
    abstract = "Modern open-domain question answering systems often rely on accurate and efficient retrieval components to find passages containing the facts necessary to answer the question. Recently, neural retrievers have gained popularity over lexical alternatives due to their superior performance. However, most of the work concerns popular languages such as English or Chinese. For others, such as Polish, few models are available. In this work, we present Silver Retriever, a neural retriever for Polish trained on a diverse collection of manually or weakly labeled datasets. Silver Retriever achieves much better results than other Polish models and is competitive with larger multilingual models. Together with the model, we open-source five new passage retrieval datasets.",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご