Open-source sentence-bert-base model - A practical tool for mapping Italian text to vector space

Home

Sentence Bert Base

Developed by efederici

Italian sentence embedding model based on sentence-transformers, mapping text to a 768-dimensional vector space

Text Embedding

Transformers

Other#Italian sentence embeddings #Semantic similarity calculation #Multilingual text matching

Downloads 409

Release Time : 3/27/2022

Model Overview

This model is specifically optimized for Italian, capable of converting sentences and paragraphs into dense vector representations, suitable for tasks such as semantic search, clustering, and similarity calculation.

Model Features

Italian language optimization

Trained specifically on Italian STS datasets, providing better semantic understanding of Italian text

Efficient vectorization

Converts text of any length into fixed-dimension (768d) dense vector representations

Semantic similarity calculation

Generated vectors can be used to accurately calculate semantic similarity between sentences

Model Capabilities

Text vectorization

Semantic similarity calculation

Text clustering

Semantic search

Use Cases

Information retrieval

Italian document retrieval

Building a semantic-based Italian search engine

Obtain more relevant results compared to keyword search

Text analysis

Italian text clustering

Automatic grouping of Italian user reviews or feedback

Can discover underlying thematic patterns

🚀 sentence-bert-base

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search. It was trained on stsb.

If you like this project, consider supporting it with a cup of coffee! 🤖✨🌞

🚀 Quick Start

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["Questo è un esempio di frase", "Questo è un ulteriore esempio"]

model = SentenceTransformer('efederici/sentence-bert-base')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["Questo è un esempio di frase", "Questo è un ulteriore esempio"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('efederici/sentence-bert-base')
model = AutoModel.from_pretrained('efederici/sentence-bert-base')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

🔧 Technical Details

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📚 Documentation

Citation

If you want to cite this model you can use this:

@misc {edoardo_federici_2022,
    author       = { {Edoardo Federici} },
    title        = { sentence-bert-base, sentence-transformer for Italian },
    year         = 2022,
    url          = { https://huggingface.co/efederici/sentence-bert-base },
    doi          = { 10.57967/hf/0112 },
    publisher    = { Hugging Face }
}

📄 License

No license information provided in the original document.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご