rubert - mini - uncased open - source model - Free computation of Russian - English sentence embedding vectors, case

Rubert Mini Uncased

Developed by sergeyzh

This model is used to compute embedding vectors for Russian and English sentences, obtained by distilling the embedding vectors from ai-forever/FRIDA. The model is of the uncased type, meaning it does not distinguish between uppercase and lowercase letters in the text.

Text Embedding

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Russian-English bilingual embedding #Sentence similarity calculation #Prefix optimization

Downloads 724

Release Time : 3/25/2025

Model Overview

This model is used to compute embedding vectors for Russian and English sentences, obtained by distilling the embedding vectors from FRIDA. The embedding vector size is 384, with 7 layers and a context size of 512 tokens. The model supports various prefix functionalities to enhance performance across different tasks.

Model Features

Multilingual support

Supports embedding vector computation for Russian and English sentences.

Prefix functionality

Inherits multi-task prefix functionality from FRIDA, allowing performance optimization for different tasks.

Mini model

Lightweight design with an embedding vector size of 384 and 7 layers, suitable for resource-constrained environments.

Case-insensitive

Uncased type, meaning it does not distinguish between uppercase and lowercase letters in the text.

Model Capabilities

Compute sentence embedding vectors

Semantic text similarity calculation

Paraphrase identification

Natural language inference

Sentiment analysis

Toxicity identification

Use Cases

Text similarity

Search query matching

Optimize the matching of search queries with documents using the search_query prefix.

Achieved an NDCG@10 score of 0.791 in the ruMTEB benchmark.

Paraphrase identification

Identify semantically similar sentences using the paraphrase prefix.

Scored 0.760 in paraphrase identification tasks.

Text classification

Sentiment analysis

Perform sentiment classification using the categorize_sentiment prefix.

Scored 0.798 in sentiment analysis tasks.

Topic classification

Perform topic classification using the categorize_topic prefix.

Achieved an accuracy of 0.884 in headline classification tasks.

🚀 rubert-mini-uncased

This model is designed to calculate sentence embeddings in Russian and English. It is obtained by distilling the embeddings of ai-forever/FRIDA (embedding size - 1536, layers - 24). The main usage mode of FRIDA, CLS pooling, is replaced with mean pooling. No other changes to the model's behavior (such as modification or filtering of embeddings, or using an additional model) are made. The distillation is carried out to the maximum extent possible - embeddings of Russian and English sentences and the work of prefixes.

The model is of the uncased type, which means it does not distinguish between uppercase and lowercase letters when processing text. For example, the phrases "С Новым Годом!" and "С НОВЫМ ГОДОМ!" are encoded with the same token sequence and have equal embedding values. The embedding size of the model is 384, with 7 layers. The model's context size is the same as that of FRIDA - 512 tokens.

✨ Features

Prefixes

All prefixes are inherited from FRIDA.

The list of used prefixes and their influence on the model's evaluations in encodechka:

Prefix	STS	PI	NLI	SA	TI
-	0.817	0.734	0.448	0.799	0.971
search_query:	0.828	0.752	0.463	0.794	0.973
search_document:	0.794	0.730	0.446	0.797	0.971
paraphrase:	0.823	0.760	0.446	0.802	0.973
categorize:	0.820	0.753	0.482	0.805	0.972
categorize_sentiment:	0.604	0.595	0.431	0.798	0.955
categorize_topic:	0.711	0.485	0.391	0.750	0.962
categorize_entailment:	0.805	0.750	0.525	0.800	0.969

Tasks:

Semantic text similarity (STS);
Paraphrase identification (PI);
Natural language inference (NLI);
Sentiment analysis (SA);
Toxicity identification (TI).

Metrics

The model's evaluations on the ruMTEB benchmark:

Model Name	Metric	Frida	rubert-mini-uncased	rubert-mini-frida	multilingual-e5-large-instruct	multilingual-e5-large
CEDRClassification	Accuracy	0.646	0.586	0.552	0.500	0.448
GeoreviewClassification	Accuracy	0.577	0.485	0.464	0.559	0.497
GeoreviewClusteringP2P	V-measure	0.783	0.683	0.698	0.743	0.605
HeadlineClassification	Accuracy	0.890	0.884	0.882	0.862	0.758
InappropriatenessClassification	Accuracy	0.783	0.705	0.698	0.655	0.616
KinopoiskClassification	Accuracy	0.705	0.607	0.595	0.661	0.566
RiaNewsRetrieval	NDCG@10	0.868	0.791	0.721	0.824	0.807
RuBQReranking	MAP@10	0.771	0.713	0.711	0.717	0.756
RuBQRetrieval	NDCG@10	0.724	0.640	0.654	0.692	0.741
RuReviewsClassification	Accuracy	0.751	0.684	0.658	0.686	0.653
RuSTSBenchmarkSTS	Pearson correlation	0.814	0.795	0.803	0.840	0.831
RuSciBenchGRNTIClassification	Accuracy	0.699	0.653	0.625	0.651	0.582
RuSciBenchGRNTIClusteringP2P	V-measure	0.670	0.618	0.586	0.622	0.520
RuSciBenchOECDClassification	Accuracy	0.546	0.509	0.491	0.502	0.445
RuSciBenchOECDClusteringP2P	V-measure	0.566	0.525	0.507	0.528	0.450
SensitiveTopicsClassification	Accuracy	0.398	0.365	0.373	0.323	0.257
TERRaClassification	Average Precision	0.665	0.604	0.604	0.639	0.584

Model Name	Metric	Frida	rubert-mini-uncased	rubert-mini-frida	multilingual-e5-large-instruct	multilingual-e5-large
Classification	Accuracy	0.707	0.657	0.631	0.654	0.588
Clustering	V-measure	0.673	0.608	0.597	0.631	0.525
MultiLabelClassification	Accuracy	0.522	0.476	0.463	0.412	0.353
PairClassification	Average Precision	0.665	0.604	0.604	0.639	0.584
Reranking	MAP@10	0.771	0.713	0.711	0.717	0.756
Retrieval	NDCG@10	0.796	0.715	0.687	0.758	0.774
STS	Pearson correlation	0.814	0.795	0.803	0.840	0.831
Average	Average	0.707	0.653	0.642	0.664	0.630

💻 Usage Examples

Basic Usage

Using with the `transformers` library:

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel


def pool(hidden_state, mask, pooling_method="mean"):
    if pooling_method == "mean":
        s = torch.sum(hidden_state * mask.unsqueeze(-1).float(), dim=1)
        d = mask.sum(axis=1, keepdim=True).float()
        return s / d
    elif pooling_method == "cls":
        return hidden_state[:, 0]

inputs = [
    # 
    "paraphrase: В Ярославской области разрешили работу бань, но без посетителей",
    "categorize_entailment: Женщину доставили в больницу, за ее жизнь сейчас борются врачи.",
    "search_query: Сколько программистов нужно, чтобы вкрутить лампочку?",
    # 
    "paraphrase: Ярославским баням разрешили работать без посетителей",
    "categorize_entailment: Женщину спасают врачи.",
    "search_document: Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование."
]

tokenizer = AutoTokenizer.from_pretrained("sergeyzh/rubert-mini-uncased")
model = AutoModel.from_pretrained("sergeyzh/rubert-mini-uncased")

tokenized_inputs = tokenizer(inputs, max_length=512, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**tokenized_inputs)
    
embeddings = pool(
    outputs.last_hidden_state, 
    tokenized_inputs["attention_mask"],
    pooling_method="mean"
)

embeddings = F.normalize(embeddings, p=2, dim=1)
sim_scores = embeddings[:3] @ embeddings[3:].T
print(sim_scores.diag().tolist())
# [0.9366128444671631, 0.8030662536621094, 0.6826460957527161]
# [0.9360030293464661, 0.8591322302818298, 0.728583037853241] - FRIDA

Using with `sentence_transformers` (sentence-transformers>=2.4.0):

from sentence_transformers import SentenceTransformer

# loads model with mean pooling
model = SentenceTransformer("sergeyzh/rubert-mini-uncased")

paraphrase = model.encode(["В Ярославской области разрешили работу бань, но без посетителей", "Ярославским баням разрешили работать без посетителей"], prompt="paraphrase: ")
print(paraphrase[0] @ paraphrase[1].T) 

# 0.9366129
# 0.9360032 - FRIDA

categorize_entailment = model.encode(["Женщину доставили в больницу, за ее жизнь сейчас борются врачи.", "Женщину спасают врачи."], prompt="categorize_entailment: ")
print(categorize_entailment[0] @ categorize_entailment[1].T) 
# 0.80306643
# 0.8591322 - FRIDA

query_embedding = model.encode("Сколько программистов нужно, чтобы вкрутить лампочку?", prompt="search_query: ")
document_embedding = model.encode("Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.", prompt="search_document: ")
print(query_embedding @ document_embedding.T) 
# 0.68264616
# 0.7285831 - FRIDA

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご