Stsb Xlm R Multilingual Ro

Developed by BlackKakapo

A sentence embedding model fine-tuned for Romanian based on the stsb-xlm-r-multilingual model, capable of mapping text to a 768-dimensional vector space

Text Embedding

Transformers

Other#Romanian Semantic Similarity #Multilingual Sentence Embeddings #768-dimensional Dense Vectors

Downloads 803

Release Time : 10/7/2022

Model Overview

This is a sentence-transformers model specifically optimized for Romanian, capable of converting sentences and paragraphs into 768-dimensional dense vectors, suitable for tasks such as semantic search, clustering, and sentence similarity calculation.

Model Features

Romanian Language Optimization

Specially fine-tuned for Romanian, achieving better semantic representation compared to generic multilingual models

Efficient Semantic Encoding

Converts variable-length text into fixed 768-dimensional vectors, preserving semantic information while reducing computational complexity

Multi-task Applicability

The generated embedding vectors can be used for various downstream tasks such as clustering, semantic search, and information retrieval

Model Capabilities

Sentence Vectorization

Semantic Similarity Calculation

Text Clustering Analysis

Cross-language Information Retrieval

Use Cases

Information Retrieval

Romanian Document Search

Enables intelligent search by calculating semantic similarity between query statements and document libraries

Delivers more relevant search results compared to keyword matching

Content Analysis

User Feedback Clustering

Automatically groups and analyzes Romanian-language user reviews

Identifies similar feedback patterns and supports theme mining

🚀 stsb-xlm-r-multilingual-ro

This is a sentence-transformers model that maps sentences & paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search. It is a fine-tuned version of stsb-xlm-r-multilingual for the Romanian language.

🚀 Quick Start

✨ Features

Maps sentences and paragraphs to a 768-dimensional dense vector space.
Suitable for tasks like clustering and semantic search.
Fine-tuned for the Romanian language.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('BlackKakapo/stsb-xlm-r-multilingual-ro')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('BlackKakapo/stsb-xlm-r-multilingual-ro')
model = AutoModel.from_pretrained('BlackKakapo/stsb-xlm-r-multilingual-ro')

📚 Documentation

Training

DataSet: STS-ro The text dataset is in Romanian (ro). Score is from 0 to 5, that's why I divide score by 5, because the score for EmbeddingSimilarityEvaluator (evaluator for finetune) needs to be from 0 to 1. Dataset Structure:

{
'score': 1.5,
 'sentence1': 'Un bărbat cântă la harpă.',
 'sentence2': 'Un bărbat cântă la claviatură.',
}

DataLoader: torch.utils.data.dataloader.DataLoader of length 223 with parameters:

{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss

Parameters of the fit()-Method:

{
    "epochs": 10,
    "evaluation_steps": 0,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 100,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Citing & Authors

BlackKakapo

📄 License

No license information provided in the original document, so this section is skipped.

🔧 Technical Details

No additional technical details provided in the original document, so this section is skipped.

Property	Details
Pipeline Tag	sentence-similarity
Tags	sentence-transformers, feature-extraction, sentence-similarity, transformers
Language	ro
Language Creators	machine-generated
Dataset	ro_sts

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご