Dense Encoder Model - Open-Source Solution for Semantic Search and Sentence Similarity Tasks, Precise and Practical!

Dense Encoder Msmarco Distilbert Word2vec256k MLM 445k Emb Updated

Developed by vocab-transformers

A sentence embedding model trained on the MS MARCO dataset, using a word2vec-initialized 256k vocabulary and DistilBERT architecture, suitable for semantic search and sentence similarity tasks

Text Embedding

Transformers

#Semantic Search Optimization #Large Vocabulary MLM #Dense Vector Encoding

Downloads 29

Release Time : 3/2/2022

Model Overview

This model is a sentence embedding model capable of mapping sentences and paragraphs into a 768-dimensional dense vector space, suitable for natural language processing tasks such as clustering and semantic search.

Model Features

Word2Vec Initialized Vocabulary

Uses a 256k vocabulary initialized with word2vec, enhancing the quality of word embeddings

MS MARCO Dataset Training

Trained on the MS MARCO dataset using MarginMSELoss, optimizing semantic search capabilities

High-Performance Sentence Embeddings

Achieved nDCG@10 scores of 66.72 and 69.14 on TREC-DL 2019 and 2020, respectively

Model Capabilities

Sentence Embedding

Semantic Search

Text Clustering

Information Retrieval

Use Cases

Information Retrieval

Document Retrieval System

Build an efficient document retrieval system that matches relevant documents based on query semantics

Achieved an MRR@10 of 34.94 on the MS MARCO development set

Question Answering System

Question Matching

Match similar questions in a question-answering system

🚀 dense_encoder-msmarco-distilbert-word2vec256k-MLM_445k

This model is designed for sentence similarity tasks. It maps sentences and paragraphs into a 768 - dimensional dense vector space, which can be used for clustering, semantic search, and other related tasks.

🚀 Quick Start

Prerequisites

You need to install the sentence-transformers library. You can install it using the following command:

pip install -U sentence-transformers

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)

✨ Features

Vocabulary: Based on vocab-transformers/msmarco-distilbert-word2vec256k-MLM_445k, it has a 256k - sized vocabulary initialized with word2vec and trained with MLM for 445k steps.
Training: Trained on MS MARCO using MarginMSELoss.

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage (Without sentence-transformers)

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Performance

MS MARCO dev: 34.94 (MRR@10)
TREC - DL 2019: 66.72 (nDCG@10)
TREC - DL 2020: 69.14 (nDCG@10)

Evaluation

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the following parameters: DataLoader: torch.utils.data.dataloader.DataLoader of length 7858 with parameters:

{'batch_size': 64, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.MarginMSELoss.MarginMSELoss Parameters of the fit()-Method:

{
    "epochs": 30,
    "evaluation_steps": 0,
    "evaluator": "NoneType",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'transformers.optimization.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 1000,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 250, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Important Note

Token embeddings were updated!

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご