ptbr-similarity-e5-small Open-source Model - Supports Portuguese-English sentence similarity calculation, free to use!

Ptbr Similarity E5 Small

Developed by jmbrito

This is a Portuguese-English sentence similarity model fine-tuned based on multilingual-e5-small, capable of mapping sentences to a 384-dimensional vector space.

Text Embedding

PyTorch

Supports Multiple LanguagesOpen Source License:MIT #Portuguese-English similarity #Sentence vectorization #Semantic search optimization

Downloads 518

Release Time : 8/25/2023

Model Overview

This model is a fine-tuned version of intfloat/multilingual-e5-small using the ASSIN2 dataset for similarity scoring, specifically designed for sentence similarity calculation tasks.

Model Features

Bilingual support

Supports sentence similarity calculation in both Portuguese and English

High-dimensional vector space

Can map sentences and paragraphs to a 384-dimensional dense vector space

Fine-tuning optimization

Optimized using the ASSIN2 dataset to enhance similarity scoring performance

Model Capabilities

Sentence vectorization

Semantic similarity calculation

Text clustering

Semantic search

Use Cases

Information retrieval

Document similarity search

Find semantically similar documents in a document library

Improves retrieval relevance and accuracy

Text analysis

Text clustering

Group semantically similar texts together

Enables unsupervised text classification

🚀 ptbr-similarity-e5-small

This model is a fine - tuned version of intfloat/multilingual-e5-small using the ASSIN2 dataset for similarity score. It maps sentences & paragraphs to a 384 - dimensional dense vector space and can be used for tasks like clustering or semantic search.

🚀 Quick Start

Using this model becomes easy when you have sentence-transformers installed. First, install the necessary library:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('jmbrito/ptbr-similarity-e5-small')
embeddings = model.encode(sentences)
print(embeddings)

✨ Features

Sentence - Similarity: This model is designed for sentence similarity tasks, making it suitable for semantic search and clustering.
Multilingual Support: Based on intfloat/multilingual-e5-small, it supports both Portuguese (pt) and English (en).

📦 Installation

To use this model, you need to install the sentence-transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('jmbrito/ptbr-similarity-e5-small')
embeddings = model.encode(sentences)
print(embeddings)

📚 Documentation

Evaluation Results

This model was evaluated using the ASSIN2 test dataset by calculating the Spearman and Pearson rank correlation. The result was 0.79934.

Training

The model was trained with the following parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 204 with parameters:

{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss

Parameters of the fit()-Method:

{
    "epochs": 10,
    "evaluation_steps": 100,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 100,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

🔧 Technical Details

The model is a fine - tuned version of intfloat/multilingual-e5-small. It uses the ASSIN2 dataset for training and evaluation. The training process involves specific data loaders, loss functions, and optimization parameters as described above. The model architecture consists of a Transformer layer, a pooling layer, and a normalization layer.

📄 License

This model is licensed under the MIT license.

📋 Model Information

Property	Details
Model Type	Fine - tuned `intfloat/multilingual-e5-small` for sentence similarity
Training Data	ASSIN2 dataset
Metrics	Spearmanr
Library Name	sentence-transformers
Languages Supported	Portuguese (`pt`), English (`en`)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご