rubert-mini-frida Open-Source Model - Rapidly Calculate Sentence Embedding Vectors for Russian and English

Rubert Mini Frida

Developed by sergeyzh

A lightweight and fast modified version of the FRIDA model for computing embedding vectors of Russian and English sentences

Text Embedding

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Russian-English bilingual embedding #Lightweight distillation #Sentence similarity

Downloads 1,203

Release Time : 3/2/2025

Model Overview

This model is implemented by distilling the embeddings of FRIDA (embedding dimension 1536, 24 layers) into rubert-mini-sts (embedding dimension 312, 7 layers), primarily used for embedding computation and similarity comparison of Russian and English sentences.

Model Features

Lightweight and efficient

Significantly reduces model size (from 24 layers to 7 layers) through distillation while maintaining good performance

Multilingual support

Supports sentence embedding computation for both Russian and English

Prefix functionality

Inherits FRIDA's prefix functionality, allowing optimization for specific tasks with different prefixes

Mean pooling

Replaces FRIDA's CLS pooling with mean pooling, making it more suitable for sentence similarity tasks

Model Capabilities

Compute sentence embedding vectors

Russian sentence similarity comparison

English sentence similarity comparison

Text classification support

Information retrieval support

Use Cases

Text similarity

Paraphrase identification

Identify whether two sentences express the same meaning in different ways

Achieved a similarity score of 0.94 on the test set

Semantic search

Build a semantic search engine to match queries with documents

Achieved NDCG@10 of 0.721 in news retrieval tasks

Classification tasks

Sentiment analysis

Classify sentiment tendencies in Russian texts

Achieved an accuracy of 0.658 in Russian review classification tasks

Topic classification

Classify topics of Russian news articles

Achieved an accuracy of 0.880 in news headline classification tasks

🚀 rubert-mini-frida - A Lightweight and Fast Modification of FRIDA

rubert-mini-frida is a model for calculating sentence embeddings in Russian and English. It is obtained by distilling the embeddings of ai-forever/FRIDA (embedding size - 1536, layers - 24) into sergeyzh/rubert-mini-sts (embedding size - 312, layers - 7). The main usage mode of FRIDA, CLS pooling, is replaced with mean pooling. No other modifications to the model's behavior (such as modifying or filtering embeddings or using an additional model) are made. The distillation is carried out to the maximum extent - including embeddings of Russian and English sentences and the work of prefixes.

Metadata

Property	Details
Language	Russian, English
Pipeline Tag	Sentence Similarity
Tags	Russian, Pretraining, Embeddings, Tiny, Feature Extraction, Sentence Similarity, Sentence Transformers, Transformers
Datasets	IlyaGusev/gazeta, zloelias/lenta-ru, HuggingFaceFW/fineweb-2, HuggingFaceFW/fineweb
License	MIT
Base Model	sergeyzh/rubert-mini-sts

✨ Features

Multilingual Support: Capable of handling both Russian and English sentences.
Lightweight Design: Based on a distilled model, it is more efficient.
Multiple Prefixes: Inherited from FRIDA, different prefixes can be used for various tasks.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Using with the `transformers` Library

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel


def pool(hidden_state, mask, pooling_method="mean"):
    if pooling_method == "mean":
        s = torch.sum(hidden_state * mask.unsqueeze(-1).float(), dim=1)
        d = mask.sum(axis=1, keepdim=True).float()
        return s / d
    elif pooling_method == "cls":
        return hidden_state[:, 0]

inputs = [
    # 
    "paraphrase: В Ярославской области разрешили работу бань, но без посетителей",
    "categorize_entailment: Женщину доставили в больницу, за ее жизнь сейчас борются врачи.",
    "search_query: Сколько программистов нужно, чтобы вкрутить лампочку?",
    # 
    "paraphrase: Ярославским баням разрешили работать без посетителей",
    "categorize_entailment: Женщину спасают врачи.",
    "search_document: Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование."
]

tokenizer = AutoTokenizer.from_pretrained("sergeyzh/rubert-mini-frida")
model = AutoModel.from_pretrained("sergeyzh/rubert-mini-frida")

tokenized_inputs = tokenizer(inputs, max_length=512, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**tokenized_inputs)
    
embeddings = pool(
    outputs.last_hidden_state, 
    tokenized_inputs["attention_mask"],
    pooling_method="mean"
)

embeddings = F.normalize(embeddings, p=2, dim=1)
sim_scores = embeddings[:3] @ embeddings[3:].T
print(sim_scores.diag().tolist())
# [0.9423348903656006, 0.8306248188018799, 0.7095720767974854]
# [0.9360030293464661, 0.8591322302818298, 0.728583037853241] - FRIDA

Using with the `sentence_transformers` Library (`sentence-transformers>=2.4.0`)

from sentence_transformers import SentenceTransformer

# loads model with mean pooling
model = SentenceTransformer("sergeyzh/rubert-mini-frida")

paraphrase = model.encode(["В Ярославской области разрешили работу бань, но без посетителей", "Ярославским баням разрешили работать без посетителей"], prompt="paraphrase: ")
print(paraphrase[0] @ paraphrase[1].T) 
# 0.94233495
# 0.9360032 - FRIDA

categorize_entailment = model.encode(["Женщину доставили в больницу, за ее жизнь сейчас борются врачи.", "Женщину спасают врачи."], prompt="categorize_entailment: ")
print(categorize_entailment[0] @ categorize_entailment[1].T) 
# 0.8306249
# 0.8591322 - FRIDA

query_embedding = model.encode("Сколько программистов нужно, чтобы вкрутить лампочку?", prompt="search_query: ")
document_embedding = model.encode("Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.", prompt="search_document: ")
print(query_embedding @ document_embedding.T) 
# 0.70957196
# 0.7285831 - FRIDA

📚 Documentation

Prefixes

All prefixes are inherited from FRIDA. The optimal prefix (providing average results) for most tasks, "categorize: ", is set by default in config_sentence_transformers.json.

The list of used prefixes and their influence on the model's evaluations in encodechka is as follows:

Prefix	STS	PI	NLI	SA	TI
-	0.839	0.762	0.475	0.801	0.972
search_query:	0.846	0.761	0.498	0.800	0.973
search_document:	0.830	0.748	0.468	0.794	0.972
paraphrase:	0.835	0.764	0.475	0.799	0.973
categorize:	0.850	0.761	0.516	0.802	0.973
categorize_sentiment:	0.755	0.656	0.427	0.798	0.959
categorize_topic:	0.734	0.523	0.389	0.728	0.959
categorize_entailment:	0.837	0.753	0.544	0.802	0.970

Tasks:

Semantic text similarity (STS);
Paraphrase identification (PI);
Natural language inference (NLI);
Sentiment analysis (SA);
Toxicity identification (TI).

Metrics

The model's evaluations on the ruMTEB benchmark are as follows:

Model Name	Metric	Frida	rubert-mini-frida	multilingual-e5-large-instruct	multilingual-e5-large
CEDRClassification	Accuracy	0.646	0.552	0.500	0.448
GeoreviewClassification	Accuracy	0.577	0.464	0.559	0.497
GeoreviewClusteringP2P	V-measure	0.783	0.698	0.743	0.605
HeadlineClassification	Accuracy	0.890	0.880	0.862	0.758
InappropriatenessClassification	Accuracy	0.783	0.698	0.655	0.616
KinopoiskClassification	Accuracy	0.705	0.595	0.661	0.566
RiaNewsRetrieval	NDCG@10	0.868	0.721	0.824	0.807
RuBQReranking	MAP@10	0.771	0.711	0.717	0.756
RuBQRetrieval	NDCG@10	0.724	0.654	0.692	0.741
RuReviewsClassification	Accuracy	0.751	0.658	0.686	0.653
RuSTSBenchmarkSTS	Pearson correlation	0.814	0.803	0.840	0.831
RuSciBenchGRNTIClassification	Accuracy	0.699	0.625	0.651	0.582
RuSciBenchGRNTIClusteringP2P	V-measure	0.670	0.586	0.622	0.520
RuSciBenchOECDClassification	Accuracy	0.546	0.493	0.502	0.445
RuSciBenchOECDClusteringP2P	V-measure	0.566	0.507	0.528	0.450
SensitiveTopicsClassification	Accuracy	0.398	0.373	0.323	0.257
TERRaClassification	Average Precision	0.665	0.606	0.639	0.584

Model Name	Metric	Frida	rubert-mini-frida	multilingual-e5-large-instruct	multilingual-e5-large
Classification	Accuracy	0.707	0.631	0.654	0.588
Clustering	V-measure	0.673	0.597	0.631	0.525
MultiLabelClassification	Accuracy	0.522	0.463	0.412	0.353
PairClassification	Average Precision	0.665	0.606	0.639	0.584
Reranking	MAP@10	0.771	0.711	0.717	0.756
Retrieval	NDCG@10	0.796	0.687	0.758	0.774
STS	Pearson correlation	0.814	0.803	0.840	0.831
Average	Average	0.707	0.643	0.664	0.630

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご