mmarco-sentence-flare-it Open Source Model - Realize Italian Semantic Search and Sentence Similarity Calculation

Mmarco Sentence Flare It

Developed by nickprock

This is an Italian sentence embedding model based on sentence-transformers, capable of mapping sentences and paragraphs into a 384-dimensional dense vector space, suitable for tasks such as semantic search and sentence similarity calculation.

Text Embedding

Transformers

OtherOpen Source License:Apache-2.0 #Italian sentence similarity #384-dimensional vector embedding #Multilingual information retrieval

Downloads 26

Release Time : 9/28/2023

Model Overview

This model is specifically optimized for Italian, capable of generating high-quality sentence embeddings, suitable for natural language processing tasks such as information retrieval, cluster analysis, and semantic similarity calculation.

Model Features

Italian optimization

Specifically trained for Italian, excelling in Italian text processing tasks

384-dimensional dense vectors

Capable of mapping sentences and paragraphs into a 384-dimensional dense vector space

Semantic search capability

Suitable for building semantic search engines and information retrieval systems

Model Capabilities

Sentence embedding generation

Semantic similarity calculation

Information retrieval

Text clustering

Use Cases

Information retrieval

Document search

Document retrieval system based on semantic similarity

Can effectively match queries with relevant documents

Text analysis

Sentence similarity calculation

Calculate the semantic similarity between two Italian sentences

Can be used in applications such as Q&A systems and duplicate detection

🚀 mmarco-sentence-flare-it

This is a sentence-transformers model that maps sentences and paragraphs to a 384-dimensional dense vector space. It can be used for tasks such as clustering or semantic search.

🚀 Quick Start

This model can be used easily after installing sentence-transformers.

✨ Features

Maps sentences and paragraphs to a 384-dimensional dense vector space.
Suitable for tasks like clustering or semantic search.

📦 Installation

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer, util

query = "Quante persone vivono a Londra?"
docs = ["A Londra vivono circa 9 milioni di persone", "Londra è conosciuta per il suo quartiere finanziario"]

#Load the model
model = SentenceTransformer('nickprock/mmarco-sentence-flare-it')

#Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)

#Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

Advanced Usage

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


#Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    return embeddings


# Sentences we want sentence embeddings for
query = "Quante persone vivono a Londra?"
docs = ["A Londra vivono circa 9 milioni di persone", "Londra è conosciuta per il suo quartiere finanziario"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("nickprock/mmarco-sentence-flare-it")
model = AutoModel.from_pretrained("nickprock/mmarco-sentence-flare-it")

#Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)

#Compute dot score between query and all document embeddings
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
print("Query:", query)
for doc, score in doc_score_pairs:
    print(score, doc)

📚 Documentation

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the following parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 7500 with parameters:

{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.TripletLoss.TripletLoss with parameters:

{'distance_metric': 'TripletDistanceMetric.EUCLIDEAN', 'triplet_margin': 5}

Parameters of the fit()-Method:

{
    "epochs": 10,
    "evaluation_steps": 500,
    "evaluator": "sentence_transformers.evaluation.TripletEvaluator.TripletEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": 1500,
    "warmup_steps": 7500,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Citing & Authors

More information about the base model here

📄 License

This model is licensed under the Apache-2.0 license.

Property	Details
Pipeline Tag	sentence-similarity
Tags	sentence-transformers, feature-extraction, sentence-similarity, transformers, mteb
License	apache-2.0
Datasets	unicamp-dl/mmarco
Language	it
Library Name	sentence-transformers
Model Name	mmarco-sentence-flare-it
Results - Task 1 (Classification - mteb/amazon_massive_intent)	Accuracy: 22.299932750504368, F1: 20.147804322480262
Results - Task 2 (Classification - mteb/amazon_massive_scenario)	Accuracy: 27.40753194351042, F1: 25.187141587127705
Results - Task 3 (STS - mteb/sts22-crosslingual-sts)	cos_sim_pearson: 30.67175493186678, cos_sim_spearman: 37.92638638971281, euclidean_pearson: 37.47072224334179, euclidean_spearman: 39.23036609148336, manhattan_pearson: 42.92657347688227, manhattan_spearman: 43.93955531904934

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご