Serafim-100m Open-source Portuguese Sentence Encoder - For Semantic Search and Text Clustering

Serafim 100m Portuguese Pt Sentence Encoder Ir

Developed by PORTULAN

This is a Portuguese sentence encoder based on sentence-transformers, which maps text to a 768-dimensional vector space, suitable for tasks such as semantic search and text clustering.

Text Embedding

Transformers

Open Source License:MIT #Portuguese sentence vectors #Semantic search optimization #768-dimensional embeddings

Downloads 4,040

Release Time : 7/4/2024

Model Overview

This model is specifically designed for Portuguese (PT), capable of converting sentences and paragraphs into high-dimensional vector representations, facilitating semantic similarity calculations and information retrieval.

Model Features

Portuguese optimization

Specifically optimized for Portuguese text, better capturing the semantic features of Portuguese.

High-dimensional vector representation

Maps text to a 768-dimensional dense vector space, facilitating semantic similarity calculations.

Sentence-level encoding

Capable of processing sentence and paragraph-level text, generating meaningful vector representations.

Model Capabilities

Text vectorization

Semantic similarity calculation

Information retrieval

Text clustering

Use Cases

Information retrieval

Document search

Build a semantic-based document search system

Improve the semantic relevance of search results

Text analysis

Text clustering

Automatically group semantically similar documents or sentences

Discover latent themes in text data

🚀 Serafim 100m Portuguese (PT) Sentence Encoder

This model, based on sentence-transformers, maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering and semantic search.

🚀 Quick Start

📦 Installation

You can install the sentence-transformers library using the following command:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('PORTULAN/serafim-100m-portuguese-pt-sentence-encoder-ir')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without the sentence-transformers library, you can use the model as follows. First, pass your input through the transformer model, and then apply the appropriate pooling operation on top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('PORTULAN/serafim-100m-portuguese-pt-sentence-encoder-ir')
model = AutoModel.from_pretrained('PORTULAN/serafim-100m-portuguese-pt-sentence-encoder-ir')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the following parameters:

DataLoader: sentence_transformers.datasets.NoDuplicatesDataLoader.NoDuplicatesDataLoader of length 361643 with parameters:

{'batch_size': 220}

Loss: sentence_transformers.losses.GISTEmbedLoss.GISTEmbedLoss with parameters:

{'guide': SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
), 'temperature': 0.01}

Parameters of the fit()-Method:

{
    "epochs": 1,
    "evaluation_steps": 1809,
    "evaluator": "sentence_transformers.evaluation.InformationRetrievalEvaluator.InformationRetrievalEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 1e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": 361643,
    "warmup_steps": 36165,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Citing & Authors

The article has been presented at EPIA 2024 conference and published by Springer:

@InProceedings{epia2024serafim,
    title={Open Sentence Embeddings for Portuguese with the Serafim PT* encoders family}, 
    author={Luís Gomes and António Branco and João Silva and João Rodrigues and Rodrigo Santos},
    editor={Manuel Filipe Santos and José Machado and Paulo Novais and Paulo Cortez and Pedro Miguel Moreira},
    booktitle={Progress in Artificial Intelligence},
    doi={doi.org/10.1007/978-3-031-73503-5_22},
    year={2024},
    publisher={Springer Nature Switzerland},
    address={Cham},
    pages={267--279},
    isbn={978-3-031-73503-5}
}

Before publication by Springer, the pre-print was available at arXiv:

@misc{gomes2024opensentenceembeddingsportuguese,
    title={Open Sentence Embeddings for Portuguese with the Serafim PT* encoders family}, 
    author={Luís Gomes and António Branco and João Silva and João Rodrigues and Rodrigo Santos},
    year={2024},
    eprint={2407.19527},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2407.19527}, 
}

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご