Sbert-all-MiniLM-L6-v2 Open-source Model - Free to Use for Sentence and Paragraph Clustering and Semantic Search

Sbert All MiniLM L6 V2

Developed by patent

This is a model based on sentence-transformers that maps sentences and paragraphs into a 384-dimensional dense vector space, suitable for tasks such as clustering or semantic search.

Text Embedding

Transformers

#Patent Text Analysis #Sentence Vectorization #Semantic Similarity

Downloads 34

Release Time : 11/4/2023

Model Overview

This model is based on the MiniLM architecture, specifically designed for generating vector representations of sentences and paragraphs. It supports converting text into 384-dimensional dense vectors for subsequent similarity calculations and semantic analysis.

Model Features

Efficient Vector Representation

Efficiently converts sentences and paragraphs into 384-dimensional dense vectors for subsequent processing and analysis.

Lightweight Model

Based on the MiniLM architecture, the model is small in size and fast in inference, making it suitable for resource-limited environments.

Multi-Task Support

Supports various natural language processing tasks such as sentence similarity calculation, clustering, and semantic search.

Model Capabilities

Sentence vectorization

Semantic similarity calculation

Text clustering

Semantic search

Use Cases

Information Retrieval

Document Similarity Search

Quickly find semantically similar documents in a large-scale document repository.

Improves retrieval efficiency and accuracy

Recommendation Systems

Content Recommendation

Recommend semantically similar content based on user historical behavior.

Enhances user satisfaction and engagement

🚀 patent/sbert-all-MiniLM-L6-v2

This is a sentence-transformers model that maps sentences and paragraphs to a 384-dimensional dense vector space. It can be used for tasks such as clustering or semantic search.

🚀 Quick Start

📦 Installation

Using this model becomes easy when you have sentence-transformers installed. You can install it using the following command:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

If you have sentence-transformers installed, you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('patent/sbert-all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model as follows. First, pass your input through the transformer model, then apply the right pooling-operation on top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('patent/sbert-all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('patent/sbert-all-MiniLM-L6-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the following parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 28316 with parameters:

{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss

Parameters of the fit()-Method:

{
    "epochs": 1,
    "evaluation_steps": 9999999,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 100,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)

Citing & Authors

📄 License

No license information provided in the original document.

🔧 Technical Details

The model belongs to the sentence-transformers family and is used for sentence similarity and feature extraction tasks. It maps text to a 384-dimensional vector space, which is useful for various NLP applications. The training process involves specific data loaders, loss functions, and optimization parameters as described above. The architecture consists of a Transformer layer followed by a pooling layer.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご