S - BlueBERT Open-source Model - Free support for sentence and paragraph processing, used for clustering and semantic search

S BlueBERT

Developed by menadsa

This is a model based on sentence-transformers, capable of mapping sentences and paragraphs into a 768-dimensional dense vector space, suitable for tasks such as clustering and semantic search.

Text Embedding

Transformers

#Sentence Vectorization #Semantic Similarity #Text Clustering

Downloads 58

Release Time : 11/18/2022

Model Overview

This model is primarily used for vectorized representation of sentences and paragraphs, supporting the conversion of text into high-dimensional vectors for similarity calculation and semantic analysis.

Model Features

High-Dimensional Vector Representation

Maps sentences and paragraphs into a 768-dimensional dense vector space, facilitating semantic analysis and similarity calculation.

Easy to Use

The model can be easily loaded and used via the sentence-transformers library.

Versatile Applications

Suitable for various natural language processing tasks such as clustering and semantic search.

Model Capabilities

Sentence Vectorization

Semantic Similarity Calculation

Text Clustering

Semantic Search

Use Cases

Information Retrieval

Semantic Search

Use this model to convert queries and documents into vectors, then calculate similarity to achieve semantic search.

Improves the relevance of search results.

Text Analysis

Text Clustering

Convert large amounts of text into vectors and perform clustering analysis.

Discovers latent themes or patterns in text data.

🚀 {MODEL_NAME}

This model is a sentence-transformers model. It maps sentences and paragraphs into a 768-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.

🚀 Quick Start

This section will guide you on how to use the model in different ways.

✨ Features

Maps sentences and paragraphs to a 768-dimensional dense vector space.
Suitable for clustering and semantic search tasks.

📦 Installation

If you want to use this model, you need to install sentence-transformers first:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

When you have sentence-transformers installed, you can use the model as follows:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Even without sentence-transformers, you can still use the model. First, pass your input through the transformer model, and then apply the appropriate pooling operation on the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the following parameters:

DataLoader: __main__.PubmedLowMemoryLoader of length 26041 with parameters:

{'batch_size': 128}

Loss: sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}

Parameters of the fit()-Method:

{
    "epochs": 1,
    "evaluation_steps": 2000,
    "evaluator": "__main__.PubmedTruePositiveIRetrievalEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 21,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Citing & Authors

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご