sentence-it5-base Open-source Model - A Free Tool for Italian Text Semantic Search and Clustering

Sentence It5 Base

Developed by efederici

Italian sentence embedding model based on IT5, mapping text to a 512-dimensional vector space, suitable for semantic search and clustering tasks

Text Embedding

Transformers

Other#Italian semantic encoding #T5 architecture optimization #Dense vector embeddings

Downloads 86

Release Time : 3/29/2022

Model Overview

This model is based on the T5 architecture, specifically optimized for Italian, capable of converting sentences and paragraphs into dense vector representations for tasks like semantic similarity calculation and information retrieval

Model Features

Italian language optimization

Specially trained for Italian, excelling in Italian text processing tasks

512-dimensional vector space

Maps text to a 512-dimensional dense vector space, preserving rich semantic information

Multi-task training

Trained with various types of data including question-answer pairs and news headlines/text to enhance model generalization

Model Capabilities

Sentence embedding generation

Semantic similarity calculation

Text clustering

Information retrieval

Use Cases

Information retrieval

Semantic search

Build search systems based on semantics rather than keywords

Improves relevance of search results

Text analysis

Document clustering

Automatically group similar documents

Achieves unsupervised document organization

🚀 sentence-IT5-base

This is a sentence-transformers model that maps sentences and paragraphs to a 512-dimensional dense vector space. It can be used for tasks such as clustering or semantic search. It is based on the T5 (IT5) base model and is trained on various datasets.

🚀 Quick Start

✨ Features

Maps sentences and paragraphs to a 512-dimensional dense vector space.
Can be used for clustering or semantic search.
Based on the T5 (IT5) base model.
Trained on multiple datasets including question/context pairs, tags/news-article pairs, headline/text pairs, and stsb.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

If you have sentence-transformers installed, you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["Questo è un esempio di frase", "Questo è un ulteriore esempio"]

model = SentenceTransformer('efederici/sentence-IT5-base')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model by passing your input through the transformer model and then applying the right pooling-operation on top of the contextualized word embeddings:

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["Questo è un esempio di frase", "Questo è un ulteriore esempio"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('efederici/sentence-IT5-base')
model = AutoModel.from_pretrained('efederici/sentence-IT5-base')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

🔧 Technical Details

The full model architecture is as follows:

SentenceTransformer(
  (0): Transformer({'max_seq_length': None, 'do_lower_case': False}) with Transformer model: T5EncoderModel 
  (1): Pooling({'word_embedding_dimension': 512, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Property	Details
Model Type	sentence-transformers
Training Data	Dataset made from question/context pairs (squad-it), tags/news-article pairs, headline/text pairs (change-it) and stsb

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご