UNSEE - CorInfoMax Open-Source Sentence Embedding Model - Free to Deploy for Sentence Similarity Calculation and Semantic Search

UNSEE CorInfoMax

Developed by asparius

This is a sentence embedding model based on sentence-transformers, capable of mapping sentences and paragraphs into a 768-dimensional dense vector space, suitable for tasks such as sentence similarity calculation and semantic search.

Text Embedding

Transformers

#Sentence Vectorization #Semantic Similarity Calculation #Dense Vector Encoding

Downloads 16

Release Time : 8/31/2023

Model Overview

This model is built using the sentence-transformers framework, primarily for generating vector representations of sentences to facilitate tasks such as sentence similarity calculation, clustering, or semantic search.

Model Features

High-dimensional Vector Representation

Can map sentences and paragraphs into a 768-dimensional dense vector space, capturing rich semantic information.

Sentence Similarity Calculation

Specially optimized for calculating semantic similarity between sentences.

Easy Integration

Can be easily integrated into existing systems through the sentence-transformers library.

Model Capabilities

Sentence Vectorization

Semantic Similarity Calculation

Text Feature Extraction

Semantic Search

Use Cases

Information Retrieval

Semantic Search

Using sentence embeddings to improve the semantic understanding capability of search engines

Enhances the relevance of search results

Text Analysis

Document Clustering

Automatically classifying and clustering documents based on sentence similarity

Achieves unsupervised document organization

🚀 asparius/UNSEE-CorInfoMax

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search.

🚀 Quick Start

This model is designed to map sentences and paragraphs into a 768-dimensional dense vector space, which is highly useful for tasks like clustering and semantic search.

✨ Features

Vector Mapping: Maps sentences and paragraphs to a 768-dimensional dense vector space.
Versatile Applications: Suitable for tasks such as clustering and semantic search.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed. You can install it using the following command:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

If you have sentence-transformers installed, you can use the model as follows:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('asparius/corinfomax-72.31')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model by first passing your input through the transformer model and then applying the appropriate pooling operation on the contextualized word embeddings:

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('asparius/corinfomax-72.31')
model = AutoModel.from_pretrained('asparius/corinfomax-72.31')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the following parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 31250 with parameters:

{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.CorInfoMax.CorInfoMaxLoss

Parameters of the fit()-Method:

{
    "epochs": 1,
    "evaluation_steps": 3125,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 0.0001
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 3125,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 64, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)

🔧 Technical Details

The model leverages the sentence-transformers framework to map text to a 768-dimensional vector space. During training, it uses the CorInfoMaxLoss and specific optimization parameters to achieve better performance. The pooling operation is crucial for aggregating word embeddings into sentence embeddings.

📄 License

No license information is provided in the original document.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご