The open-source model fiqa-tsdae-msmarco-distilbert-gpl - can be used for sentence similarity calculation and semantic search

Fiqa Tsdae Msmarco Distilbert Gpl

Developed by GPL

This is a model based on sentence-transformers that maps sentences and paragraphs into a 768-dimensional dense vector space, suitable for tasks such as sentence similarity calculation and semantic search.

Text Embedding

Transformers

#Sentence Vectorization #Semantic Similarity #Text Clustering

Downloads 33

Release Time : 3/2/2022

Model Overview

This model is specifically designed to calculate semantic similarity between sentences and paragraphs, capable of generating high-quality sentence embeddings, suitable for applications such as information retrieval and clustering analysis.

Model Features

High-Quality Sentence Embeddings

Capable of generating 768-dimensional high-quality sentence embeddings that effectively capture semantic information.

Semantic Similarity Calculation

Specially optimized for calculating semantic similarity between sentences and paragraphs.

Easy Integration

Can be easily integrated into existing systems via the sentence-transformers library.

Model Capabilities

Sentence Vectorization

Semantic Similarity Calculation

Text Clustering

Semantic Search

Use Cases

Information Retrieval

Semantic Search System

Build a search system based on semantics rather than keywords.

Improves the relevance and accuracy of search results.

Text Analysis

Document Clustering

Automatically group documents based on semantic similarity.

Discovers thematic structures within document collections.

🚀 {MODEL_NAME}

This model, based on sentence-transformers, maps sentences and paragraphs to a 768-dimensional dense vector space. It can be effectively used for tasks such as clustering and semantic search.

🚀 Quick Start

📦 Installation

To use this model, you need to install sentence-transformers first:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Even without sentence-transformers, you can still use this model. First, pass your input through the transformer model, and then apply the appropriate pooling operation on the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

🔍 Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

🔧 Technical Details

Training

The model was trained with the following parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 140000 with parameters:

{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.SequentialSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: gpl.toolkit.loss.MarginDistillationLoss

Parameters of the fit()-Method:

{
    "epochs": 1,
    "evaluation_steps": 0,
    "evaluator": "NoneType",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'transformers.optimization.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": 140000,
    "warmup_steps": 1000,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 350, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 Citing & Authors

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご