đ ptbr-similarity-e5-small
This model is a fine - tuned version of intfloat/multilingual-e5-small
using the ASSIN2 dataset for similarity score. It maps sentences & paragraphs to a 384 - dimensional dense vector space and can be used for tasks like clustering or semantic search.
đ Quick Start
Using this model becomes easy when you have sentence-transformers installed. First, install the necessary library:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('jmbrito/ptbr-similarity-e5-small')
embeddings = model.encode(sentences)
print(embeddings)
⨠Features
- Sentence - Similarity: This model is designed for sentence similarity tasks, making it suitable for semantic search and clustering.
- Multilingual Support: Based on
intfloat/multilingual-e5-small
, it supports both Portuguese (pt
) and English (en
).
đĻ Installation
To use this model, you need to install the sentence-transformers
library:
pip install -U sentence-transformers
đģ Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('jmbrito/ptbr-similarity-e5-small')
embeddings = model.encode(sentences)
print(embeddings)
đ Documentation
Evaluation Results
This model was evaluated using the ASSIN2 test dataset by calculating the Spearman and Pearson rank correlation. The result was 0.79934.
Training
The model was trained with the following parameters:
DataLoader:
torch.utils.data.dataloader.DataLoader
of length 204 with parameters:
{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
Loss:
sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss
Parameters of the fit()-Method:
{
"epochs": 10,
"evaluation_steps": 100,
"evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 100,
"weight_decay": 0.01
}
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
(2): Normalize()
)
đ§ Technical Details
The model is a fine - tuned version of intfloat/multilingual-e5-small
. It uses the ASSIN2 dataset for training and evaluation. The training process involves specific data loaders, loss functions, and optimization parameters as described above. The model architecture consists of a Transformer layer, a pooling layer, and a normalization layer.
đ License
This model is licensed under the MIT license.
đ Model Information
Property |
Details |
Model Type |
Fine - tuned intfloat/multilingual-e5-small for sentence similarity |
Training Data |
ASSIN2 dataset |
Metrics |
Spearmanr |
Library Name |
sentence-transformers |
Languages Supported |
Portuguese (pt ), English (en ) |