USER2-small Open-Source Russian Sentence Encoder - Free Support for Long-Context Sentence Representation Applications

USER2 Small

Developed by deepvk

USER2 is a next-generation Russian universal sentence encoder, specifically designed to support long-context sentence representations of up to 8,192 tokens.

Text Embedding

Safetensors

OtherOpen Source License:Apache-2.0 #Russian long-text retrieval #Matryoshka representation learning #Multi-task prefix optimization

Downloads 1,409

Release Time : 2/19/2025

Model Overview

Built on the RuModernBERT encoder and fine-tuned for retrieval and semantic tasks, it supports Matryoshka Representation Learning (MRL) technology, which allows reducing embedding dimensions with minimal quality loss.

Model Features

Long-context support

Supports long-context sentence representations of up to 8,192 tokens

Matryoshka Representation Learning (MRL)

Allows reducing embedding dimensions with minimal quality loss, supporting multiple dimensions [32, 64, 128, 256, 384]

Efficient small model

A compact model with only 34 million parameters, reducing computational resource requirements while maintaining performance

Task prefix optimization

Supports performance optimization for different scenarios by adding task prefixes (e.g., classification/clustering/search_query)

Model Capabilities

Text embedding generation

Sentence similarity calculation

Semantic retrieval

Text clustering

Classification tasks

Re-ranking tasks

Use Cases

Information retrieval

Document retrieval

Used in long-document retrieval systems, supporting long-context understanding of 8,192 tokens

Achieved nDCG@10 of 51.69 in the MLDR-rus test

Semantic analysis

Sentence similarity calculation

Calculates semantic similarity between two sentences or text segments

Scored 72.25 in the MTEB-rus semantic similarity task

Text classification

Multi-label classification

Suitable for scenarios requiring multi-label classification

Scored 33.56 in the MTEB-rus multi-label classification task

🚀 USER2-small

USER2 is a new-generation Universal Sentence Encoder for Russian. It's designed for sentence representation and supports long contexts of up to 8,192 tokens.

The models are built on top of the RuModernBERT encoders and fine-tuned for retrieval and semantic tasks. They also support Matryoshka Representation Learning (MRL), a technique that can reduce embedding size with minimal loss in representation quality.

This is a small model with 34 million parameters.

🚀 Quick Start

Model Information

Property	Details
Model Type	Sentence Transformer
Base Model	deepvk/RuModernBERT-small
Training Datasets	nomic-en, nomic-ru, in-house En - Ru parallel, cultura-sampled, etc.
License	apache-2.0

Model Comparison

Model	Size	Context Length	Hidden Dim	MRL Dims
`deepvk/USER2-small`	34M	8192	384	[32, 64, 128, 256, 384]
`deepvk/USER2-base`	149M	8192	768	[32, 64, 128, 256, 384, 512, 768]

✨ Features

Long-context Support: Capable of handling contexts up to 8,192 tokens.
Matryoshka Representation Learning (MRL): Allows for dimensionality reduction of embeddings with minimal quality loss.
Task-specific Prefixes: Supports task-specific prefixes for better performance in different tasks.

💻 Usage Examples

Prefixes

This model is trained similarly to Nomic Embed and requires task-specific prefixes to be added to the input. The choice of prefix depends on the specific task. Here are some general guidelines:

"classification: " is the default and most universal prefix, often performing well across various tasks.
"clustering: " is recommended for clustering applications, such as grouping texts into clusters, discovering shared topics, or removing semantic duplicates.
"search_query: " and "search_document: " are intended for retrieval and reranking tasks. In some classification tasks, especially with shorter texts, "search_query" shows better performance than other prefixes. On the other hand, "search_document" can be beneficial for long-context sentence similarity tasks.

Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("deepvk/USER2-small")

query_embeddings = model.encode(["Когда был спущен на воду первый миноносец «Спокойный»?"], prompt_name="search_query")
document_embeddings = model.encode(["Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года."], prompt_name="search_document")

similarities = model.similarity(query_embeddings, document_embeddings)

To truncate the embedding dimension, simply pass the new value to the model initialization:

model = SentenceTransformer("deepvk/USER2-small", truncate_dim=128)

This model was trained with dimensions [32, 64, 128, 256, 384], so it’s recommended to use one of these for best performance.

Transformers

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel


def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )


queries = ["search_query: Когда был спущен на воду первый миноносец «Спокойный»?"]
documents = ["search_document: Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года."]

tokenizer = AutoTokenizer.from_pretrained("deepvk/USER2-small")
model = AutoModel.from_pretrained("deepvk/USER2-small")

encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    queries_outputs = model(**encoded_queries)
    documents_outputs = model(**encoded_documents)

query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)

similarities = query_embeddings @ doc_embeddings.T

To truncate the embedding dimension, select the first values:

query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = query_embeddings[:, :truncate_dim]
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)

📚 Documentation

Performance

To evaluate the model, we measure quality on the MTEB-rus benchmark. Additionally, to measure long-context retrieval, we run the Russian subset of the MultiLongDocRetrieval (MLDR) task.

MTEB-rus

Model	Size	Hidden Dim	Context Length	MRL support	Mean(task)	Mean(taskType)	Classification	Clustering	MultiLabelClassification	PairClassification	Reranking	Retrieval	STS
`USER-base`	124M	768	512	❌	58.11	56.67	59.89	53.26	37.72	59.76	55.58	56.14	74.35
`USER-bge-m3`	359M	1024	8192	❌	62.80	62.28	61.92	53.66	36.18	65.07	68.72	73.63	76.76
`multilingual-e5-base`	278M	768	512	❌	58.34	57.24	58.25	50.27	33.65	54.98	66.24	67.14	70.16
`multilingual-e5-large-instruct`	560M	1024	512	❌	65.00	63.36	66.28	63.13	41.15	63.89	64.35	68.23	76.48
`jina-embeddings-v3`	572M	1024	8192	✅	63.45	60.93	65.24	60.90	39.24	59.22	53.86	71.99	76.04
`ru-en-RoSBERTa`	404M	1024	512	❌	61.71	60.40	62.56	56.06	38.88	60.79	63.89	66.52	74.13
`USER2-small`	34M	384	8192	✅	58.32	56.68	59.76	57.06	33.56	54.02	58.26	61.87	72.25
`USER2-base`	149M	768	8192	✅	61.12	59.59	61.67	59.22	36.61	56.39	62.06	66.90	74.28

MLDR-rus

Model	Size	nDCG@10 ↑
`USER-bge-m3`	359M	58.53
`KaLM-v1.5`	494M	53.75
`jina-embeddings-v3`	572M	49.67
`E5-mistral-7b`	7.11B	52.40
`USER2-small`	34M	51.69
`USER2-base`	149M	54.17

We compare only models with a context length of 8192.

Matryoshka

To evaluate MRL capabilities, we also use MTEB-rus, applying dimensionality cropping to the embeddings to match the selected size.

MRL

🔧 Technical Details

Training details

This is the small version with 34 million parameters, based on RuModernBERT-small. It was fine-tuned in three stages: RetroMAE, Weakly Supervised Fine-Tuning, and Supervised Fine-Tuning.

Following the bge-m3 training strategy, we use RetroMAE as a retrieval-oriented continuous pretraining step. Leveraging data from the final stage of RuModernBERT training, RetroMAE enhances retrieval quality, especially for long-context inputs.

To follow best practices for building a state-of-the-art encoder, we rely on large-scale training with weakly related text pairs. However, such datasets are not publicly available for Russian, unlike for English or Chinese. To overcome this, we apply two complementary strategies:

Cross-lingual transfer: We train on both English and Russian data, leveraging English resources (nomic-unsupervised) alongside our in-house English-Russian parallel corpora.
Unsupervised pair mining: From the deepvk/cultura_ru_edu corpus, we extract 50M pairs using a simple heuristic—selecting non-overlapping text blocks that are not substrings of one another.

This approach has shown promising results, allowing us to train high-performing models with minimal target-language pairs, especially when compared to pipelines used for other languages.

The table below shows the datasets used and the number of times each was upsampled.

Dataset	Size	Upsample
nomic-en	235M	1
nomic-ru	39M	3
in-house En-Ru parallel	250M	1
cultura-sampled	50M	1
Total	652M

For the third stage, we switch to cleaner, task-specific datasets. In some cases, additional filtering was applied using a cross-encoder. For all retrieval datasets, we mine hard negatives.

Dataset	Examples	Notes
Nomic-en-supervised	1.7 M	Unmodified
AllNLI	200 K	Translated SNLI/MNLI/ANLI to Russian
fishkinet-posts	93 K	Title–content pairs
gazeta	55 K	Title–text pairs
habr_qna	100 K	Title–description pairs
lenta	100 K	Title–news pairs
miracl_ru	10 K	One positive per anchor
mldr_ru	1.8 K	Unmodified
mr-tydi_ru	5.3 K	Unmodified
mmarco_ru	500 K	Unmodified
ru-HNP	100 K	One pos + one neg per anchor
ru‑queries	199 K	In-house (generated as in arXiv:2401.00368)
ru‑WaNLI	35 K	Entailment -> pos, contradiction -> neg
sampled_wiki	1 M	Sampled text blocks from Wikipedia
summ_dialog_news	37 K	Summary–info pairs
wikiomnia_qna	100 K	QA pairs (T5-generated)
yandex_q	83 K	Q+desc-answer pairs
Total	4.3 M

Ablation

Alongside the final model, we also release all intermediate training steps. Both the retromae and weakly_sft models are available under the specified revisions in this repository. We hope these additional models prove useful for your experiments.

Below is a comparison of all training stages on a subset of MTEB-rus. training_stages

📄 License

This project is licensed under the Apache-2.0 license.

📖 Citations

@misc{deepvk2025user,
    title={USER2},
    author={Malashenko, Boris and Spirin, Egor and Sokolov Andrey},
    url={https://huggingface.co/deepvk/USER2-small},
    publisher={Hugging Face},
    year={2025},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご