MSMarco-BERT Base Dot V5 Fine-Tuned AI Open-Source Semantic Search Model - Empowering Precise and Efficient Information Retrieval

Msmarco Bert Base Dot V5 Fine Tuned AI

Developed by Adel-Elwan

A semantic search model based on BERT architecture, optimized for information retrieval systems, capable of mapping text to a 768-dimensional vector space

Text Embedding

Transformers

English#Semantic Search Optimization #Specialized for IR Systems #High Recall Rate

Downloads 18

Release Time : 7/24/2023

Model Overview

This model is a semantic embedding model based on the sentence-transformers framework, fine-tuned on the MSMARCO dataset, suitable for tasks such as sentence similarity calculation, semantic search, and information retrieval.

Model Features

Efficient Semantic Encoding

Capable of efficiently encoding sentences and paragraphs into 768-dimensional dense vectors while preserving semantic information

Fine-tuning Optimization

Fine-tuned on the MSMARCO dataset, making it particularly suitable for information retrieval scenarios

Multi-task Support

Supports various downstream tasks such as clustering and semantic search

Model Capabilities

Text Vectorization

Semantic Similarity Calculation

Information Retrieval

Text Clustering

Use Cases

Information Retrieval

Document Search System

Building a semantic-based document retrieval system

Top-5 accuracy 83.45%, Top-10 accuracy 87.78%

Q&A System

Used for question-answer matching in Q&A systems

MRR@10 reaches 0.7327

Content Recommendation

🚀 Adel-Elwan/msmarco-bert-base-dot-v5-fine-tuned-AI

This model is based on sentence-transformers. It maps sentences and paragraphs to a 768-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.

🚀 Quick Start

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('Adel-Elwan/msmarco-bert-base-dot-v5-fine-tuned-AI')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('Adel-Elwan/msmarco-bert-base-dot-v5-fine-tuned-AI')
model = AutoModel.from_pretrained('Adel-Elwan/msmarco-bert-base-dot-v5-fine-tuned-AI')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 6563 with parameters:

{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:

{'scale': 20.0, 'similarity_fct': 'dot_score'}

Parameters of the fit()-Method:

{
    "epochs": 1,
    "evaluation_steps": 5000,
    "evaluator": "sentence_transformers.evaluation.InformationRetrievalEvaluator.InformationRetrievalEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'transformers.optimization.AdamW'>",
    "optimizer_params": {
        "correct_bias": false,
        "eps": 1e-06,
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 656,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Information Table

Property	Details
Pipeline Tag	question-answering
Tags	semantic-search, sentence-similarity, sentence-transformers, transformers, artificial-intelligence, computer-science
Language	en
Metrics	accuracy
Datasets	Adel-Elwan/Artificial-intelligence-dataset-for-IR-systems

Model Index

Task Type	Task Name	Dataset Type	Dataset Name	Split	Metrics
semantic-search	Semantic Search	Adel-Elwan/Artificial-intelligence-dataset-for-IR-systems	Artificial intelligence dataset for IR systems	test	Accuracy@5: 83.45%, Accuracy@10: 87.78%, Precision@5: 16.69%, Recall@5: 83.45%, Recall@10: 87.78%, MRR@10: 0.7327 (verified: true)

📄 License

No license information provided in the original document, so this section is skipped.

Citing & Authors

No detailed information provided in the original document, so this section is skipped.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご