🚀 rubert-mini-uncased
This model is designed to calculate sentence embeddings in Russian and English. It is obtained by distilling the embeddings of ai-forever/FRIDA (embedding size - 1536, layers - 24). The main usage mode of FRIDA, CLS pooling, is replaced with mean pooling. No other changes to the model's behavior (such as modification or filtering of embeddings, or using an additional model) are made. The distillation is carried out to the maximum extent possible - embeddings of Russian and English sentences and the work of prefixes.
The model is of the uncased type, which means it does not distinguish between uppercase and lowercase letters when processing text. For example, the phrases "С Новым Годом!" and "С НОВЫМ ГОДОМ!" are encoded with the same token sequence and have equal embedding values. The embedding size of the model is 384, with 7 layers. The model's context size is the same as that of FRIDA - 512 tokens.
✨ Features
Prefixes
All prefixes are inherited from FRIDA.
The list of used prefixes and their influence on the model's evaluations in encodechka:
Prefix |
STS |
PI |
NLI |
SA |
TI |
- |
0.817 |
0.734 |
0.448 |
0.799 |
0.971 |
search_query: |
0.828 |
0.752 |
0.463 |
0.794 |
0.973 |
search_document: |
0.794 |
0.730 |
0.446 |
0.797 |
0.971 |
paraphrase: |
0.823 |
0.760 |
0.446 |
0.802 |
0.973 |
categorize: |
0.820 |
0.753 |
0.482 |
0.805 |
0.972 |
categorize_sentiment: |
0.604 |
0.595 |
0.431 |
0.798 |
0.955 |
categorize_topic: |
0.711 |
0.485 |
0.391 |
0.750 |
0.962 |
categorize_entailment: |
0.805 |
0.750 |
0.525 |
0.800 |
0.969 |
Tasks:
- Semantic text similarity (STS);
- Paraphrase identification (PI);
- Natural language inference (NLI);
- Sentiment analysis (SA);
- Toxicity identification (TI).
Metrics
The model's evaluations on the ruMTEB benchmark:
Model Name |
Metric |
Frida |
rubert-mini-uncased |
rubert-mini-frida |
multilingual-e5-large-instruct |
multilingual-e5-large |
CEDRClassification |
Accuracy |
0.646 |
0.586 |
0.552 |
0.500 |
0.448 |
GeoreviewClassification |
Accuracy |
0.577 |
0.485 |
0.464 |
0.559 |
0.497 |
GeoreviewClusteringP2P |
V-measure |
0.783 |
0.683 |
0.698 |
0.743 |
0.605 |
HeadlineClassification |
Accuracy |
0.890 |
0.884 |
0.882 |
0.862 |
0.758 |
InappropriatenessClassification |
Accuracy |
0.783 |
0.705 |
0.698 |
0.655 |
0.616 |
KinopoiskClassification |
Accuracy |
0.705 |
0.607 |
0.595 |
0.661 |
0.566 |
RiaNewsRetrieval |
NDCG@10 |
0.868 |
0.791 |
0.721 |
0.824 |
0.807 |
RuBQReranking |
MAP@10 |
0.771 |
0.713 |
0.711 |
0.717 |
0.756 |
RuBQRetrieval |
NDCG@10 |
0.724 |
0.640 |
0.654 |
0.692 |
0.741 |
RuReviewsClassification |
Accuracy |
0.751 |
0.684 |
0.658 |
0.686 |
0.653 |
RuSTSBenchmarkSTS |
Pearson correlation |
0.814 |
0.795 |
0.803 |
0.840 |
0.831 |
RuSciBenchGRNTIClassification |
Accuracy |
0.699 |
0.653 |
0.625 |
0.651 |
0.582 |
RuSciBenchGRNTIClusteringP2P |
V-measure |
0.670 |
0.618 |
0.586 |
0.622 |
0.520 |
RuSciBenchOECDClassification |
Accuracy |
0.546 |
0.509 |
0.491 |
0.502 |
0.445 |
RuSciBenchOECDClusteringP2P |
V-measure |
0.566 |
0.525 |
0.507 |
0.528 |
0.450 |
SensitiveTopicsClassification |
Accuracy |
0.398 |
0.365 |
0.373 |
0.323 |
0.257 |
TERRaClassification |
Average Precision |
0.665 |
0.604 |
0.604 |
0.639 |
0.584 |
Model Name |
Metric |
Frida |
rubert-mini-uncased |
rubert-mini-frida |
multilingual-e5-large-instruct |
multilingual-e5-large |
Classification |
Accuracy |
0.707 |
0.657 |
0.631 |
0.654 |
0.588 |
Clustering |
V-measure |
0.673 |
0.608 |
0.597 |
0.631 |
0.525 |
MultiLabelClassification |
Accuracy |
0.522 |
0.476 |
0.463 |
0.412 |
0.353 |
PairClassification |
Average Precision |
0.665 |
0.604 |
0.604 |
0.639 |
0.584 |
Reranking |
MAP@10 |
0.771 |
0.713 |
0.711 |
0.717 |
0.756 |
Retrieval |
NDCG@10 |
0.796 |
0.715 |
0.687 |
0.758 |
0.774 |
STS |
Pearson correlation |
0.814 |
0.795 |
0.803 |
0.840 |
0.831 |
Average |
Average |
0.707 |
0.653 |
0.642 |
0.664 |
0.630 |
💻 Usage Examples
Basic Usage
Using with the transformers
library:
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
def pool(hidden_state, mask, pooling_method="mean"):
if pooling_method == "mean":
s = torch.sum(hidden_state * mask.unsqueeze(-1).float(), dim=1)
d = mask.sum(axis=1, keepdim=True).float()
return s / d
elif pooling_method == "cls":
return hidden_state[:, 0]
inputs = [
"paraphrase: В Ярославской области разрешили работу бань, но без посетителей",
"categorize_entailment: Женщину доставили в больницу, за ее жизнь сейчас борются врачи.",
"search_query: Сколько программистов нужно, чтобы вкрутить лампочку?",
"paraphrase: Ярославским баням разрешили работать без посетителей",
"categorize_entailment: Женщину спасают врачи.",
"search_document: Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование."
]
tokenizer = AutoTokenizer.from_pretrained("sergeyzh/rubert-mini-uncased")
model = AutoModel.from_pretrained("sergeyzh/rubert-mini-uncased")
tokenized_inputs = tokenizer(inputs, max_length=512, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**tokenized_inputs)
embeddings = pool(
outputs.last_hidden_state,
tokenized_inputs["attention_mask"],
pooling_method="mean"
)
embeddings = F.normalize(embeddings, p=2, dim=1)
sim_scores = embeddings[:3] @ embeddings[3:].T
print(sim_scores.diag().tolist())
Using with sentence_transformers
(sentence-transformers>=2.4.0):
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sergeyzh/rubert-mini-uncased")
paraphrase = model.encode(["В Ярославской области разрешили работу бань, но без посетителей", "Ярославским баням разрешили работать без посетителей"], prompt="paraphrase: ")
print(paraphrase[0] @ paraphrase[1].T)
categorize_entailment = model.encode(["Женщину доставили в больницу, за ее жизнь сейчас борются врачи.", "Женщину спасают врачи."], prompt="categorize_entailment: ")
print(categorize_entailment[0] @ categorize_entailment[1].T)
query_embedding = model.encode("Сколько программистов нужно, чтобы вкрутить лампочку?", prompt="search_query: ")
document_embedding = model.encode("Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.", prompt="search_document: ")
print(query_embedding @ document_embedding.T)
📄 License
This project is licensed under the MIT license.