🚀 USER2-base
USER2 is a new generation of the Universal Sentence Encoder for Russian, designed to represent sentences with long-context support of up to 8,192 tokens. This model is built on top of the RuModernBERT
encoders and fine-tuned for retrieval and semantic tasks. It also supports Matryoshka Representation Learning (MRL), a technique that enables reducing embedding size with minimal loss in representation quality. This is a base model with 149 million parameters.
🚀 Quick Start
USER2-base is a powerful model for sentence representation in Russian. It can handle long contexts and supports MRL for efficient embedding. To get started, you can use the provided code examples in the "Usage" section.
✨ Features
- Long-context Support: Handles up to 8,192 tokens, suitable for long texts.
- MRL Support: Reduces embedding size with minimal quality loss.
- Fine-tuned for Retrieval and Semantic Tasks: Performs well in various tasks.
📦 Installation
No specific installation steps are provided in the original README.
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("deepvk/USER2-base")
query_embeddings = model.encode(["Когда был спущен на воду первый миноносец «Спокойный»?"], prompt_name="search_query")
document_embeddings = model.encode(["Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года."], prompt_name="search_document")
similarities = model.similarity(query_embeddings, document_embeddings)
Advanced Usage
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = (
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
)
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
input_mask_expanded.sum(1), min=1e-9
)
queries = ["search_query: Когда был спущен на воду первый миноносец «Спокойный»?"]
documents = ["search_document: Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года."]
tokenizer = AutoTokenizer.from_pretrained("deepvk/USER2-base")
model = AutoModel.from_pretrained("deepvk/USER2-base")
encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
queries_outputs = model(**encoded_queries)
documents_outputs = model(**encoded_documents)
query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
similarities = query_embeddings @ doc_embeddings.T
📚 Documentation
Performance
To evaluate the model, we measure quality on the MTEB-rus
benchmark. Additionally, to measure long-context retrieval, we run the Russian subset of the MultiLongDocRetrieval (MLDR) task.
MTEB-rus
Model |
Size |
Hidden Dim |
Context Length |
MRL support |
Mean(task) |
Mean(taskType) |
Classification |
Clustering |
MultiLabelClassification |
PairClassification |
Reranking |
Retrieval |
STS |
USER-base |
124M |
768 |
512 |
❌ |
58.11 |
56.67 |
59.89 |
53.26 |
37.72 |
59.76 |
55.58 |
56.14 |
74.35 |
USER-bge-m3 |
359M |
1024 |
8192 |
❌ |
62.80 |
62.28 |
61.92 |
53.66 |
36.18 |
65.07 |
68.72 |
73.63 |
76.76 |
multilingual-e5-base |
278M |
768 |
512 |
❌ |
58.34 |
57.24 |
58.25 |
50.27 |
33.65 |
54.98 |
66.24 |
67.14 |
70.16 |
multilingual-e5-large-instruct |
560M |
1024 |
512 |
❌ |
65.00 |
63.36 |
66.28 |
63.13 |
41.15 |
63.89 |
64.35 |
68.23 |
76.48 |
jina-embeddings-v3 |
572M |
1024 |
8192 |
✅ |
63.45 |
60.93 |
65.24 |
60.90 |
39.24 |
59.22 |
53.86 |
71.99 |
76.04 |
ru-en-RoSBERTa |
404M |
1024 |
512 |
❌ |
61.71 |
60.40 |
62.56 |
56.06 |
38.88 |
60.79 |
63.89 |
66.52 |
74.13 |
USER2-small |
34M |
384 |
8192 |
✅ |
58.32 |
56.68 |
59.76 |
57.06 |
33.56 |
54.02 |
58.26 |
61.87 |
72.25 |
USER2-base |
149M |
768 |
8192 |
✅ |
61.12 |
59.59 |
61.67 |
59.22 |
36.61 |
56.39 |
62.06 |
66.90 |
74.28 |
MLDR-rus
Model |
Size |
nDCG@10 ↑ |
USER-bge-m3 |
359M |
58.53 |
KaLM-v1.5 |
494M |
53.75 |
jina-embeddings-v3 |
572M |
49.67 |
E5-mistral-7b |
7.11B |
52.40 |
USER2-small |
34M |
51.69 |
USER2-base |
149M |
54.17 |
We compare only models with a context length of 8192.
Matryoshka
To evaluate MRL capabilities, we also use MTEB-rus
, applying dimensionality cropping to the embeddings to match the selected size.

Prefixes
This model is trained similarly to Nomic Embed and expects task-specific prefixes to be added to the input. The choice of prefix depends on the specific task. We follow a few general guidelines when selecting a prefix:
- "classification: " is the default and most universal prefix, often performing well across a variety of tasks.
- "clustering: " is recommended for clustering applications: group texts into clusters, discover shared topics, or remove semantic duplicates.
- "search_query: " and "search_document: " are intended for retrieval and reranking tasks. Also, in some classification tasks, especially with shorter texts, "search_query" shows superior performance to other prefixes. On the other hand, "search_document" can be beneficial for long-context sentence similarity tasks.
However, we encourage users to experiment with different prefixes, as certain domains may benefit from specific ones.
Training details
This is the base version with 149 million parameters, based on RuModernBERT-base
. It was fine-tuned in three stages: RetroMAE, Weakly Supervised Fine-Tuning, and Supervised Fine-Tuning.
Following the bge-m3 training strategy, we use RetroMAE as a retrieval-oriented continuous pretraining step. Leveraging data from the final stage of RuModernBERT training, RetroMAE enhances retrieval quality—particularly for long-context inputs.
To follow best practices for building a state-of-the-art encoder, we rely on large-scale training with weakly related text pairs. However, such datasets are not publicly available for Russian, unlike for English or Chinese. To overcome this, we apply two complementary strategies:
- Cross-lingual transfer: We train on both English and Russian data, leveraging English resources (
nomic-unsupervised
) alongside our in-house English-Russian parallel corpora.
- Unsupervised pair mining: From the
deepvk/cultura_ru_edu
corpus, we extract 50M pairs using a simple heuristic—selecting non-overlapping text blocks that are not substrings of one another.
This approach has shown promising results, allowing us to train high-performing models with minimal target-language pairs—especially when compared to pipelines used for other languages.
The table below shows the datasets used and the number of times each was upsampled.
For the third stage, we switch to cleaner, task-specific datasets. In some cases, additional filtering was applied using a cross-encoder. For all retrieval datasets, we mine hard negatives.
Ablation
Alongside the final model, we also release all intermediate training steps. Both the retromae and weakly_sft models are available under the specified revisions in this repository. We hope these additional models prove useful for your experiments.
Below is a comparison of all training stages on a subset of MTEB-rus
.

🔧 Technical Details
The model is based on the RuModernBERT
architecture and fine-tuned for specific tasks. It uses MRL to optimize embedding size. The training process involves multiple stages and strategies to handle the lack of publicly available Russian datasets.
📄 License
This project is licensed under the Apache-2.0 license.
📖 Citations
@misc{deepvk2025user,
title={USER2},
author={Malashenko, Boris and Spirin, Egor and Sokolov Andrey},
url={https://huggingface.co/deepvk/USER2-base},
publisher={Hugging Face},
year={2025},
}