USER2-base Open-source Russian Sentence Encoder - Supports Ultra-long Contexts, Optimizes Retrieval and Semantic Tasks

USER2 Base

Developed by deepvk

USER2 is a next-generation Russian universal sentence encoder, supporting context sentence representations up to 8,192 tokens, built on RuModernBERT-base and optimized for retrieval and semantic tasks

Text Embedding

Safetensors

OtherOpen Source License:Apache-2.0 #Russian Long-Text Retrieval #Matryoshka Representation Learning #Multi-Task Prefix Optimization

Downloads 1,101

Release Time : 2/25/2025

Model Overview

A universal sentence encoder specifically designed for Russian, supporting long-context representations and Matryoshka Representation Learning (MRL) technology, suitable for retrieval and various semantic tasks

Model Features

Long Context Support

Supports processing texts up to 8,192 tokens, ideal for long document retrieval and analysis

Matryoshka Representation Learning (MRL)

Supports dimension pruning technology to reduce embedding dimensions with minimal quality loss

Multi-Task Prefix Optimization

Employs task-specific prefix design to optimize representations for different scenarios (classification/clustering/retrieval)

Efficient Parameter Design

The base version with 149 million parameters achieves a good balance between performance and efficiency

Model Capabilities

Text Embedding Generation

Semantic Similarity Calculation

Document Retrieval

Text Clustering

Multi-Label Classification

Re-Ranking Tasks

Use Cases

Information Retrieval

Long Document Retrieval

Finding relevant information in long document collections

Achieves nDCG@10 of 54.17 on MLDR-rus test

Question Answering System

Matching questions with candidate answers

Text Analysis

Text Clustering

Grouping similar documents together

Scores 59.22 on MTEB-rus clustering task

Semantic Similarity Calculation

Measuring semantic relationships between texts

Scores 74.28 on MTEB-rus similarity task

🚀 USER2-base

USER2 is a new generation of the Universal Sentence Encoder for Russian, designed to represent sentences with long-context support of up to 8,192 tokens. This model is built on top of the RuModernBERT encoders and fine-tuned for retrieval and semantic tasks. It also supports Matryoshka Representation Learning (MRL), a technique that enables reducing embedding size with minimal loss in representation quality. This is a base model with 149 million parameters.

🚀 Quick Start

USER2-base is a powerful model for sentence representation in Russian. It can handle long contexts and supports MRL for efficient embedding. To get started, you can use the provided code examples in the "Usage" section.

✨ Features

Long-context Support: Handles up to 8,192 tokens, suitable for long texts.
MRL Support: Reduces embedding size with minimal quality loss.
Fine-tuned for Retrieval and Semantic Tasks: Performs well in various tasks.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("deepvk/USER2-base")

query_embeddings = model.encode(["Когда был спущен на воду первый миноносец «Спокойный»?"], prompt_name="search_query")
document_embeddings = model.encode(["Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года."], prompt_name="search_document")

similarities = model.similarity(query_embeddings, document_embeddings)

Advanced Usage

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel


def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )


queries = ["search_query: Когда был спущен на воду первый миноносец «Спокойный»?"]
documents = ["search_document: Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года."]

tokenizer = AutoTokenizer.from_pretrained("deepvk/USER2-base")
model = AutoModel.from_pretrained("deepvk/USER2-base")

encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    queries_outputs = model(**encoded_queries)
    documents_outputs = model(**encoded_documents)

query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)

similarities = query_embeddings @ doc_embeddings.T

📚 Documentation

Performance

To evaluate the model, we measure quality on the MTEB-rus benchmark. Additionally, to measure long-context retrieval, we run the Russian subset of the MultiLongDocRetrieval (MLDR) task.

MTEB-rus

Model	Size	Hidden Dim	Context Length	MRL support	Mean(task)	Mean(taskType)	Classification	Clustering	MultiLabelClassification	PairClassification	Reranking	Retrieval	STS
`USER-base`	124M	768	512	❌	58.11	56.67	59.89	53.26	37.72	59.76	55.58	56.14	74.35
`USER-bge-m3`	359M	1024	8192	❌	62.80	62.28	61.92	53.66	36.18	65.07	68.72	73.63	76.76
`multilingual-e5-base`	278M	768	512	❌	58.34	57.24	58.25	50.27	33.65	54.98	66.24	67.14	70.16
`multilingual-e5-large-instruct`	560M	1024	512	❌	65.00	63.36	66.28	63.13	41.15	63.89	64.35	68.23	76.48
`jina-embeddings-v3`	572M	1024	8192	✅	63.45	60.93	65.24	60.90	39.24	59.22	53.86	71.99	76.04
`ru-en-RoSBERTa`	404M	1024	512	❌	61.71	60.40	62.56	56.06	38.88	60.79	63.89	66.52	74.13
`USER2-small`	34M	384	8192	✅	58.32	56.68	59.76	57.06	33.56	54.02	58.26	61.87	72.25
`USER2-base`	149M	768	8192	✅	61.12	59.59	61.67	59.22	36.61	56.39	62.06	66.90	74.28

MLDR-rus

Model	Size	nDCG@10 ↑
`USER-bge-m3`	359M	58.53
`KaLM-v1.5`	494M	53.75
`jina-embeddings-v3`	572M	49.67
`E5-mistral-7b`	7.11B	52.40
`USER2-small`	34M	51.69
`USER2-base`	149M	54.17

We compare only models with a context length of 8192.

Matryoshka

To evaluate MRL capabilities, we also use MTEB-rus, applying dimensionality cropping to the embeddings to match the selected size.

MRL

Prefixes

This model is trained similarly to Nomic Embed and expects task-specific prefixes to be added to the input. The choice of prefix depends on the specific task. We follow a few general guidelines when selecting a prefix:

"classification: " is the default and most universal prefix, often performing well across a variety of tasks.
"clustering: " is recommended for clustering applications: group texts into clusters, discover shared topics, or remove semantic duplicates.
"search_query: " and "search_document: " are intended for retrieval and reranking tasks. Also, in some classification tasks, especially with shorter texts, "search_query" shows superior performance to other prefixes. On the other hand, "search_document" can be beneficial for long-context sentence similarity tasks.

However, we encourage users to experiment with different prefixes, as certain domains may benefit from specific ones.

Training details

This is the base version with 149 million parameters, based on RuModernBERT-base. It was fine-tuned in three stages: RetroMAE, Weakly Supervised Fine-Tuning, and Supervised Fine-Tuning.

Following the bge-m3 training strategy, we use RetroMAE as a retrieval-oriented continuous pretraining step. Leveraging data from the final stage of RuModernBERT training, RetroMAE enhances retrieval quality—particularly for long-context inputs.

To follow best practices for building a state-of-the-art encoder, we rely on large-scale training with weakly related text pairs. However, such datasets are not publicly available for Russian, unlike for English or Chinese. To overcome this, we apply two complementary strategies:

Cross-lingual transfer: We train on both English and Russian data, leveraging English resources (nomic-unsupervised) alongside our in-house English-Russian parallel corpora.
Unsupervised pair mining: From the deepvk/cultura_ru_edu corpus, we extract 50M pairs using a simple heuristic—selecting non-overlapping text blocks that are not substrings of one another.

This approach has shown promising results, allowing us to train high-performing models with minimal target-language pairs—especially when compared to pipelines used for other languages.

The table below shows the datasets used and the number of times each was upsampled.

Dataset	Size	Upsample
nomic-en	235M	1
nomic-ru	39M	3
in-house En-Ru parallel	250M	1
cultura-sampled	50M	1
Total	652M

For the third stage, we switch to cleaner, task-specific datasets. In some cases, additional filtering was applied using a cross-encoder. For all retrieval datasets, we mine hard negatives.

Dataset	Examples	Notes
Nomic-en-supervised	1.7 M	Unmodified
AllNLI	200 K	Translated SNLI/MNLI/ANLI to Russian
fishkinet-posts	93 K	Title–content pairs
gazeta	55 K	Title–text pairs
habr_qna	100 K	Title–description pairs
lenta	100 K	Title–news pairs
miracl_ru	10 K	One positive per anchor
mldr_ru	1.8 K	Unmodified
mr-tydi_ru	5.3 K	Unmodified
mmarco_ru	500 K	Unmodified
ru-HNP	100 K	One pos + one neg per anchor
ru‑queries	199 K	In-house (generated as in arXiv:2401.00368)
ru‑WaNLI	35 K	Entailment -> pos, contradiction -> neg
sampled_wiki	1 M	Sampled text blocks from Wikipedia
summ_dialog_news	37 K	Summary–info pairs
wikiomnia_qna	100 K	QA pairs (T5-generated)
yandex_q	83 K	Q+desc-answer pairs
Total	4.3 M

Ablation

Alongside the final model, we also release all intermediate training steps. Both the retromae and weakly_sft models are available under the specified revisions in this repository. We hope these additional models prove useful for your experiments.

Below is a comparison of all training stages on a subset of MTEB-rus.

training_stages

🔧 Technical Details

The model is based on the RuModernBERT architecture and fine-tuned for specific tasks. It uses MRL to optimize embedding size. The training process involves multiple stages and strategies to handle the lack of publicly available Russian datasets.

📄 License

This project is licensed under the Apache-2.0 license.

📖 Citations

@misc{deepvk2025user,
    title={USER2},
    author={Malashenko, Boris and Spirin, Egor and Sokolov Andrey},
    url={https://huggingface.co/deepvk/USER2-base},
    publisher={Hugging Face},
    year={2025},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご