Open-source Model of dense_encoder-msmarco-distilbert-word2vec256k - Free Deployment to Assist Sentence Similarity Tasks

Dense Encoder Msmarco Distilbert Word2vec256k

Developed by vocab-transformers

A sentence encoder based on msmarco-word2vec256000-distilbert-base-uncased, using a word2vec-initialized 256k vocabulary, specifically designed for sentence similarity tasks

Text Embedding

Transformers

Downloads 38

Release Time : 3/2/2022

Model Overview

This model is a sentence transformer primarily used for feature extraction and sentence similarity calculation. It was trained on the MS MARCO dataset using MarginMSELoss and is suitable for scenarios like information retrieval.

Model Features

Word2vec-initialized vocabulary

Uses a 256k vocabulary initialized with word2vec, potentially providing better word vector representations

Frozen word embeddings training

The word embedding matrix is frozen during training to preserve the characteristics of pre-trained word vectors

MarginMSELoss training

Trained using MarginMSELoss to optimize the similarity relationships between sentence pairs

Model Capabilities

Sentence feature extraction

Calculate sentence similarity

Information retrieval

Use Cases

Information retrieval

Document retrieval

Can be used to build search engines that return relevant results based on semantic similarity between queries and documents

Question answering systems

Can be used to match user questions with candidate answers in a knowledge base

Semantic matching

Duplicate question detection

Identify differently phrased but semantically similar questions

🚀 dense_encoder-msmarco-distilbert-word2vec256k

This model is designed for sentence similarity tasks, leveraging a pre - trained architecture with a 256k sized vocabulary initialized by word2vec to provide high - quality sentence embeddings.

🚀 Quick Start

This model is based on msmarco-word2vec256000-distilbert-base-uncased with a 256k sized vocabulary initialized with word2vec. It has been trained on MS MARCO using MarginMSELoss. See the train_script.py in this repository.

Performance

MS MARCO dev: - (MRR@10)
TREC - DL 2019: 65.53 (nDCG@10)
TREC - DL 2020: 67.42 (nDCG@10)
Avg. on 4 BEIR datasets: 38.97

The word embedding matrix has been frozen while training.

✨ Features

Based on a pre - trained model with a 256k word2vec - initialized vocabulary.
Trained on MS MARCO using MarginMSELoss.
Frozen word embedding matrix during training.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage (Sentence - Transformers)

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling - operation on - top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 7858 with parameters:

{'batch_size': 64, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.MarginMSELoss.MarginMSELoss

Parameters of the fit() - Method:

{
    "epochs": 30,
    "evaluation_steps": 0,
    "evaluator": "NoneType",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'transformers.optimization.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 1000,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 250, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Citing & Authors

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご