Bi-encoder-russian-msmarco Open-source Model - A Free Tool for Asymmetric Semantic Search in Russian

Bi Encoder Russian Msmarco

Developed by DiTy

A sentence-transformers model fine-tuned on the MS-MARCO Russian passage ranking dataset, based on the DeepPavlov/rubert-base-cased pre-trained model, designed for asymmetric semantic search in Russian.

Text Embedding

Transformers

OtherOpen Source License:MIT #Russian semantic search #High-precision retrieval #Medical text analysis

Downloads 74.33k

Release Time : 4/16/2024

Model Overview

This model maps sentences and paragraphs into a 768-dimensional dense vector space, primarily used for asymmetric semantic search tasks in Russian, enabling efficient sentence similarity computation.

Model Features

Efficient semantic search

Capable of quickly computing semantic similarity between Russian sentences, suitable for large-scale document retrieval scenarios.

Asymmetric search capability

Supports similarity comparison between query sentences and long paragraphs, ideal for applications like Q&A systems.

High-precision retrieval

Achieves a recall@5 of 0.9997 on the mMARCO Russian test set, demonstrating excellent performance.

Model Capabilities

Russian text feature extraction

Sentence similarity computation

Semantic search

Document retrieval

Use Cases

Information retrieval

Medical Q&A system

Matching user medical questions with professional answers in a knowledge base

Accurately finds relevant medical explanations

Legal document retrieval

Retrieving relevant legal clauses based on short queries

Quickly locates relevant legal provisions

Content recommendation

News article recommendation

Recommending similar news based on user reading history

Enhances user reading experience

🚀 DiTy/bi-encoder-russian-msmarco

This project presents a sentence-transformers model. It is based on the pre - trained DeepPavlov/rubert-base-cased and fine - tuned with the MS - MARCO Russian passage ranking dataset. It maps sentences and paragraphs into a 768 - dimensional dense vector space, enabling asymmetric semantic search in the Russian language.

🚀 Quick Start

✨ Features

Based on the pre - trained DeepPavlov/rubert-base-cased model.
Fine - tuned on the unicamp-dl/mmarco dataset.
Capable of performing asymmetric semantic search in Russian.

📦 Installation

To use this model, you need to install the sentence-transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

If you have installed the sentence-transformers library, you can use the model as follows:

from sentence_transformers import SentenceTransformer, util

sentences = [
    'какое состояние может определить тест с физической нагрузкой', 
    'Тест с физической нагрузкой разработан, чтобы выяснить, содержат ли одна или несколько коронарных артерий, питающих сердце, жировые отложения (бляшки), которые блокируют кровеносный сосуд на 70% или более. Для подтверждения результата часто требуется дополнительное тестирование. Результат испытаний.',
    'Тест направлен на то, чтобы выяснить, не получает ли какой-либо участок сердечной мышцы достаточный кровоток во время тренировки. Он похож на тест с физической нагрузкой, фармакологический или химический стресс-тест. Он также известен при стресс-тесте таллием, сканировании перфузии миокарда или радионуклидном тесте.'
]

model = SentenceTransformer('DiTy/bi-encoder-russian-msmarco')
embeddings = model.encode(sentences)
results = util.semantic_search(embeddings[0], embeddings[1:])[0]

print(f"Sentence similarity: {results}")
# `Sentence similarity: [{'corpus_id': 0, 'score': 0.8545001149177551}, {'corpus_id': 1, 'score': 0.023047829046845436}]`

Advanced Usage

Even without the sentence-transformers library, you can still use the model. First, pass your input through the transformer model, and then apply the appropriate pooling operation on top of the contextualized word embeddings:

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = [
  'красный плоский лишай вызван стрессом',
  'В большинстве случаев причину появления красного плоского лишая невозможно. Это не вызвано стрессом, но иногда эмоциональный стресс усугубляет ситуацию. Известно, что это заболевание возникает после контакта с определенными химическими веществами, такими как те, которые используются для проявления цветных фотографий. У некоторых людей определенные лекарства вызывают красный плоский лишай. Эти препараты включают лекарства от высокого кровяного давления, болезней сердца, диабета, артрита и малярии, антибиотики, нестероидные противовоспалительные обезболивающие и т. Д.',
  'К сожалению для работодателей, в разных штатах страны есть несколько дел, по которым суды установили, что стресс, вызванный работой, может быть основанием для увольнения с работы, если стресс достигает уровня серьезного состояния здоровья, которое вызывает они не могут выполнять свою работу.',
]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('DiTy/bi-encoder-russian-msmarco')
model = AutoModel.from_pretrained('DiTy/bi-encoder-russian-msmarco')

# Tokenize sentences
encoded_input = tokenizer(sentences, max_length=512, padding='max_length', truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

🔧 Technical Details

Training Parameters

The model was trained with the following parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 1989041 with parameters:

{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}

Parameters of the fit() method:

{
    "epochs": 5,
    "evaluation_steps": 250000,
    "evaluator": "sentence_transformers.evaluation.InformationRetrievalEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 10000,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご