mmlw - retrieval - e5 - large Open - source Neural Text Encoder - Optimize Polish Information Retrieval, Free to Use!

Mmlw Retrieval E5 Large

Developed by sdadas

MMLW is a neural text encoder for Polish, optimized for information retrieval tasks, capable of converting queries and passages into 1024-dimensional vectors

Text Embedding

Transformers

OtherOpen Source License:Apache-2.0 #Polish retrieval #Dense vector encoding #Contrastive learning optimization

Downloads 56

Release Time : 10/18/2023

Model Overview

This model is specifically designed for Polish information retrieval tasks through multilingual knowledge distillation and contrastive loss fine-tuning, encoding queries and passages into high-dimensional vectors for similarity calculation

Model Features

Multilingual knowledge distillation

Uses English FlagEmbeddings as the teacher model, trained on 60 million Polish-English text pairs through knowledge distillation

Contrastive loss fine-tuning

Fine-tuned on the Polish version of the MS MARCO dataset with large-batch contrastive learning to optimize retrieval performance

Prefix-aware encoding

Improves retrieval accuracy by adding 'query:' and 'passage:' prefixes to distinguish between query and passage encoding

Model Capabilities

Text vectorization

Semantic similarity calculation

Information retrieval

Cross-language retrieval

Use Cases

Search engines

Polish document retrieval

Retrieves the most relevant content from a Polish document library based on user queries

Achieved an NDCG@10 score of 58.30 on the PIRB benchmark

Q&A systems

Polish FAQ matching

Semantically matches user questions with a FAQ database

🚀 MMLW-retrieval-e5-large

MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. This model is optimized for information retrieval tasks.

🚀 Quick Start

MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. This model is optimized for information retrieval tasks. It can transform queries and passages to 1024 dimensional vectors.

The model was developed using a two-step procedure:

In the first step, it was initialized with multilingual E5 checkpoint, and then trained with multilingual knowledge distillation method on a diverse corpus of 60 million Polish-English text pairs. The English FlagEmbeddings (BGE) were utilised as teacher models for distillation.
The second step involved fine-tuning the obtained models with contrastrive loss on Polish MS MARCO training split. In order to improve the efficiency of contrastive training, large batch sizes were used - 1152 for small, 768 for base, and 288 for large models. Fine-tuning was conducted on a cluster of 12 A100 GPUs.

⚠️ Important Note

2023-12-26: We have updated the model to a new version with improved results. You can still download the previous version using the v1 tag: AutoModel.from_pretrained("sdadas/mmlw-retrieval-e5-large", revision="v1")

✨ Features

Optimized for Information Retrieval: Specifically designed for information retrieval tasks.
High - Dimensional Vector Transformation: Can transform queries and passages to 1024 dimensional vectors.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

⚠️ Our dense retrievers require the use of specific prefixes and suffixes when encoding texts. For this model, queries should be prefixed with "query: " and passages with "passage: "

You can use the model like this with sentence-transformers:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

query_prefix = "query: "
answer_prefix = "passage: "
queries = [query_prefix + "Jak dożyć 100 lat?"]
answers = [
    answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
    answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model = SentenceTransformer("sdadas/mmlw-retrieval-e5-large")
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)

best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
print(answers[best_answer])
# Trzeba zdrowo się odżywiać i uprawiać sport.

📚 Documentation

Evaluation Results

The model achieves NDCG@10 of 58.30 on the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.

Acknowledgements

This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative.

Citation

@article{dadas2024pirb,
  title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods}, 
  author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata},
  year={2024},
  eprint={2402.13350},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

📄 License

The model is licensed under the apache - 2.0 license.

Property	Details
Pipeline Tag	sentence-similarity
Tags	sentence-transformers, feature-extraction, sentence-similarity, transformers, information-retrieval
Language	pl
License	apache-2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご