🚀 rubert-mini-frida - A Lightweight and Fast Modification of FRIDA
rubert-mini-frida is a model for calculating sentence embeddings in Russian and English. It is obtained by distilling the embeddings of ai-forever/FRIDA (embedding size - 1536, layers - 24) into sergeyzh/rubert-mini-sts (embedding size - 312, layers - 7). The main usage mode of FRIDA, CLS pooling, is replaced with mean pooling. No other modifications to the model's behavior (such as modifying or filtering embeddings or using an additional model) are made. The distillation is carried out to the maximum extent - including embeddings of Russian and English sentences and the work of prefixes.
Metadata
Property |
Details |
Language |
Russian, English |
Pipeline Tag |
Sentence Similarity |
Tags |
Russian, Pretraining, Embeddings, Tiny, Feature Extraction, Sentence Similarity, Sentence Transformers, Transformers |
Datasets |
IlyaGusev/gazeta, zloelias/lenta-ru, HuggingFaceFW/fineweb-2, HuggingFaceFW/fineweb |
License |
MIT |
Base Model |
sergeyzh/rubert-mini-sts |
✨ Features
- Multilingual Support: Capable of handling both Russian and English sentences.
- Lightweight Design: Based on a distilled model, it is more efficient.
- Multiple Prefixes: Inherited from FRIDA, different prefixes can be used for various tasks.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
Using with the transformers
Library
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
def pool(hidden_state, mask, pooling_method="mean"):
if pooling_method == "mean":
s = torch.sum(hidden_state * mask.unsqueeze(-1).float(), dim=1)
d = mask.sum(axis=1, keepdim=True).float()
return s / d
elif pooling_method == "cls":
return hidden_state[:, 0]
inputs = [
"paraphrase: В Ярославской области разрешили работу бань, но без посетителей",
"categorize_entailment: Женщину доставили в больницу, за ее жизнь сейчас борются врачи.",
"search_query: Сколько программистов нужно, чтобы вкрутить лампочку?",
"paraphrase: Ярославским баням разрешили работать без посетителей",
"categorize_entailment: Женщину спасают врачи.",
"search_document: Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование."
]
tokenizer = AutoTokenizer.from_pretrained("sergeyzh/rubert-mini-frida")
model = AutoModel.from_pretrained("sergeyzh/rubert-mini-frida")
tokenized_inputs = tokenizer(inputs, max_length=512, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**tokenized_inputs)
embeddings = pool(
outputs.last_hidden_state,
tokenized_inputs["attention_mask"],
pooling_method="mean"
)
embeddings = F.normalize(embeddings, p=2, dim=1)
sim_scores = embeddings[:3] @ embeddings[3:].T
print(sim_scores.diag().tolist())
Using with the sentence_transformers
Library (sentence-transformers>=2.4.0
)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sergeyzh/rubert-mini-frida")
paraphrase = model.encode(["В Ярославской области разрешили работу бань, но без посетителей", "Ярославским баням разрешили работать без посетителей"], prompt="paraphrase: ")
print(paraphrase[0] @ paraphrase[1].T)
categorize_entailment = model.encode(["Женщину доставили в больницу, за ее жизнь сейчас борются врачи.", "Женщину спасают врачи."], prompt="categorize_entailment: ")
print(categorize_entailment[0] @ categorize_entailment[1].T)
query_embedding = model.encode("Сколько программистов нужно, чтобы вкрутить лампочку?", prompt="search_query: ")
document_embedding = model.encode("Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.", prompt="search_document: ")
print(query_embedding @ document_embedding.T)
📚 Documentation
Prefixes
All prefixes are inherited from FRIDA. The optimal prefix (providing average results) for most tasks, "categorize: ", is set by default in config_sentence_transformers.json.
The list of used prefixes and their influence on the model's evaluations in encodechka is as follows:
Prefix |
STS |
PI |
NLI |
SA |
TI |
- |
0.839 |
0.762 |
0.475 |
0.801 |
0.972 |
search_query: |
0.846 |
0.761 |
0.498 |
0.800 |
0.973 |
search_document: |
0.830 |
0.748 |
0.468 |
0.794 |
0.972 |
paraphrase: |
0.835 |
0.764 |
0.475 |
0.799 |
0.973 |
categorize: |
0.850 |
0.761 |
0.516 |
0.802 |
0.973 |
categorize_sentiment: |
0.755 |
0.656 |
0.427 |
0.798 |
0.959 |
categorize_topic: |
0.734 |
0.523 |
0.389 |
0.728 |
0.959 |
categorize_entailment: |
0.837 |
0.753 |
0.544 |
0.802 |
0.970 |
Tasks:
- Semantic text similarity (STS);
- Paraphrase identification (PI);
- Natural language inference (NLI);
- Sentiment analysis (SA);
- Toxicity identification (TI).
Metrics
The model's evaluations on the ruMTEB benchmark are as follows:
Model Name |
Metric |
Frida |
rubert-mini-frida |
multilingual-e5-large-instruct |
multilingual-e5-large |
CEDRClassification |
Accuracy |
0.646 |
0.552 |
0.500 |
0.448 |
GeoreviewClassification |
Accuracy |
0.577 |
0.464 |
0.559 |
0.497 |
GeoreviewClusteringP2P |
V-measure |
0.783 |
0.698 |
0.743 |
0.605 |
HeadlineClassification |
Accuracy |
0.890 |
0.880 |
0.862 |
0.758 |
InappropriatenessClassification |
Accuracy |
0.783 |
0.698 |
0.655 |
0.616 |
KinopoiskClassification |
Accuracy |
0.705 |
0.595 |
0.661 |
0.566 |
RiaNewsRetrieval |
NDCG@10 |
0.868 |
0.721 |
0.824 |
0.807 |
RuBQReranking |
MAP@10 |
0.771 |
0.711 |
0.717 |
0.756 |
RuBQRetrieval |
NDCG@10 |
0.724 |
0.654 |
0.692 |
0.741 |
RuReviewsClassification |
Accuracy |
0.751 |
0.658 |
0.686 |
0.653 |
RuSTSBenchmarkSTS |
Pearson correlation |
0.814 |
0.803 |
0.840 |
0.831 |
RuSciBenchGRNTIClassification |
Accuracy |
0.699 |
0.625 |
0.651 |
0.582 |
RuSciBenchGRNTIClusteringP2P |
V-measure |
0.670 |
0.586 |
0.622 |
0.520 |
RuSciBenchOECDClassification |
Accuracy |
0.546 |
0.493 |
0.502 |
0.445 |
RuSciBenchOECDClusteringP2P |
V-measure |
0.566 |
0.507 |
0.528 |
0.450 |
SensitiveTopicsClassification |
Accuracy |
0.398 |
0.373 |
0.323 |
0.257 |
TERRaClassification |
Average Precision |
0.665 |
0.606 |
0.639 |
0.584 |
Model Name |
Metric |
Frida |
rubert-mini-frida |
multilingual-e5-large-instruct |
multilingual-e5-large |
Classification |
Accuracy |
0.707 |
0.631 |
0.654 |
0.588 |
Clustering |
V-measure |
0.673 |
0.597 |
0.631 |
0.525 |
MultiLabelClassification |
Accuracy |
0.522 |
0.463 |
0.412 |
0.353 |
PairClassification |
Average Precision |
0.665 |
0.606 |
0.639 |
0.584 |
Reranking |
MAP@10 |
0.771 |
0.711 |
0.717 |
0.756 |
Retrieval |
NDCG@10 |
0.796 |
0.687 |
0.758 |
0.774 |
STS |
Pearson correlation |
0.814 |
0.803 |
0.840 |
0.831 |
Average |
Average |
0.707 |
0.643 |
0.664 |
0.630 |
📄 License
This project is licensed under the MIT License.