Model Overview
Model Features
Model Capabilities
Use Cases
🚀 USER-base
The Universal Sentence Encoder for Russian (USER) is a sentence-transformer model designed specifically for extracting embeddings in the Russian language. It maps sentences and paragraphs into a 768-dimensional dense vector space, which can be used for tasks like clustering or semantic search.
This model is initialized from deepvk/deberta-v1-base
and is trained to work solely with the Russian language. Its performance on other languages has not been evaluated.
🚀 Quick Start
Using this model becomes easy when you have sentence-transformers
installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
queries = [
"Когда был спущен на воду первый миноносец «Спокойный»?",
"Есть ли нефть в Удмуртии?"
]
passages = [
"Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
"Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
]
model = SentenceTransformer("deepvk/USER-base")
# Prompt should be specified according to the task (either 'query' or 'passage').
passage_embeddings = model.encode(passages, normalize_embeddings=True, prompt_name='passage')
# For tasks other than retrieval, you can simply use the `query` prompt, which is set by default.
query_embeddings = model.encode(queries, normalize_embeddings=True)
However, you can use model directly with transformers
import torch.nn.functional as F
from torch import Tensor, inference_mode
from transformers import AutoTokenizer, AutoModel
def average_pool(
last_hidden_states: Tensor,
attention_mask: Tensor
) -> Tensor:
last_hidden = last_hidden_states.masked_fill(
~attention_mask[..., None].bool(), 0.0
)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
# You should manually add prompts when using the model directly. Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
input_texts = [
"query: Когда был спущен на воду первый миноносец «Спокойный»?",
"query: Есть ли нефть в Удмуртии?",
"passage: Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
"passage: Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
]
tokenizer = AutoTokenizer.from_pretrained("deepvk/USER-base")
model = AutoModel.from_pretrained("deepvk/USER-base")
batch_dict = tokenizer(
input_texts, padding=True, truncation=True, return_tensors="pt"
)
with inference_mode():
outputs = model(**batch_dict)
embeddings = average_pool(
outputs.last_hidden_state, batch_dict["attention_mask"]
)
embeddings = F.normalize(embeddings, p=2, dim=1)
# Scores for query-passage
scores = (embeddings[:2] @ embeddings[2:].T) * 100
# [[55.86, 30.95],
# [22.82, 59.46]]
print(scores.round(decimals=2))
⚠️ Important Note
Each input text should start with "query: " or "passage: ". For tasks other than retrieval, you can simply use the "query: " prefix.
✨ Features
- Russian-specific: Designed exclusively for the Russian language, mapping sentences and paragraphs to a 768-dimensional dense vector space.
- Versatile applications: Can be used for tasks like clustering or semantic search.
📦 Installation
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
queries = [
"Когда был спущен на воду первый миноносец «Спокойный»?",
"Есть ли нефть в Удмуртии?"
]
passages = [
"Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
"Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
]
model = SentenceTransformer("deepvk/USER-base")
# Prompt should be specified according to the task (either 'query' or 'passage').
passage_embeddings = model.encode(passages, normalize_embeddings=True, prompt_name='passage')
# For tasks other than retrieval, you can simply use the `query` prompt, which is set by default.
query_embeddings = model.encode(queries, normalize_embeddings=True)
Advanced Usage
import torch.nn.functional as F
from torch import Tensor, inference_mode
from transformers import AutoTokenizer, AutoModel
def average_pool(
last_hidden_states: Tensor,
attention_mask: Tensor
) -> Tensor:
last_hidden = last_hidden_states.masked_fill(
~attention_mask[..., None].bool(), 0.0
)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
# You should manually add prompts when using the model directly. Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
input_texts = [
"query: Когда был спущен на воду первый миноносец «Спокойный»?",
"query: Есть ли нефть в Удмуртии?",
"passage: Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
"passage: Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
]
tokenizer = AutoTokenizer.from_pretrained("deepvk/USER-base")
model = AutoModel.from_pretrained("deepvk/USER-base")
batch_dict = tokenizer(
input_texts, padding=True, truncation=True, return_tensors="pt"
)
with inference_mode():
outputs = model(**batch_dict)
embeddings = average_pool(
outputs.last_hidden_state, batch_dict["attention_mask"]
)
embeddings = F.normalize(embeddings, p=2, dim=1)
# Scores for query-passage
scores = (embeddings[:2] @ embeddings[2:].T) * 100
# [[55.86, 30.95],
# [22.82, 59.46]]
print(scores.round(decimals=2))
🔧 Technical Details
Training Strategy
We aimed to follow the bge-base-en
model training algorithm, but made several improvements along the way:
- Initialization: The model is initialized from
deepvk/deberta-v1-base
. - First-stage: Contrastive pre-training with weak supervision on the Russian part of mMarco corpus.
- Second-stage: Supervised fine-tuning two different models based on data symmetry and then merging via
LM-Cocktail
.
Dataset
During model development, we additionally collected 2 datasets:
deepvk/ru-HNP
and
deepvk/ru-WANLI
.
Property | Details |
---|---|
Symmetric Datasets | AllNLI (282,644), MedNLI (3,699), RCB (392), Terra (1,359), Tapaco (91,240), Opus100 (1,000,000), BiblePar (62,195), deepvk/ru-WANLI (35,455), deepvk/ru-HNP (500,000) |
Asymmetric Datasets | MIRACL (10,000), MLDR (1,864), Lenta (185,972), Mlsum (51,112), Mr-TyDi (536,600), Panorama (11,024), PravoIsrael (26,364), Xlsum (124,486), Fialka-v1 (130,000), RussianKeywords (16,461), Gazeta (121,928), Gsm8k-ru (7,470), DSumRu (27,191), SummDialogNews (75,700) |
Total positive pairs | 3,352,653 |
Total negative pairs | 792,644 (negative pairs from AIINLI, MIRACL, deepvk/ru-WANLI, deepvk/ru-HNP) |
For all labeled datasets, we only use its training set for fine-tuning. For datasets Gazeta, Mlsum, Xlsum: pairs (title/text) and (title/summary) are combined and used as asymmetric data. AllNLI
is a Russian translation of the combination of SNLI, MNLI, and ANLI.
📚 Documentation
Experiments
As a baseline, we chose the current top models from the encodechka
leaderboard table. In addition, we evaluated the model on the Russian subset of MTEB
, which includes 10 tasks. Unfortunately, we could not validate the bge-m3 on some MTEB tasks, specifically clustering, due to excessive computational resources. Besides these two benchmarks, we also evaluated the models on the MIRACL
. All experiments were conducted using an NVIDIA TESLA A100 40 GB GPU. We used validation scripts from the official repositories for each of the tasks.
Model | Size (w/o Embeddings) | Encodechka (Mean S) | MTEB (Mean Ru) | Miracl (Recall@100) |
---|---|---|---|---|
bge-m3 |
303 | 0.786 | 0.694 | 0.959 |
multilingual-e5-large |
303 | 0.78 | 0.665 | 0.927 |
USER (this model) |
85 | 0.772 | 0.666 | 0.763 |
paraphrase-multilingual-mpnet-base-v2 |
85 | 0.76 | 0.625 | 0.149 |
multilingual-e5-base |
85 | 0.756 | 0.645 | 0.915 |
LaBSE-en-ru |
85 | 0.74 | 0.599 | 0.327 |
sn-xlm-roberta-base-snli-mnli-anli-xnli |
85 | 0.74 | 0.593 | 0.08 |
Our solution outperforms all other models of the same size on both Encodechka and MTEB. Given that the model is slightly underperforming in retrieval tasks relative to existing solutions, we aim to address this in our future research.
📄 License
This project is licensed under the Apache 2.0 license.
📚 FAQ
Do I need to add the prefix "query: " and "passage: " to input texts?
Yes, this is how the model is trained, otherwise you will see a performance degradation. Here are some rules of thumb:
- Use
"query: "
and"passage: "
correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval. - Use
"query: "
prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval. - Use
"query: "
prefix if you want to use embeddings as features, such as linear probing classification, clustering.
📖 Citations
@misc{deepvk2024user,
title={USER: Universal Sentence Encoder for Russian},
author={Malashenko, Boris and Zemerov, Anton and Spirin, Egor},
url={https://huggingface.co/datasets/deepvk/USER-base},
publisher={Hugging Face},
year={2024},
}





