USER-base Open-source Model - Designed specifically for Russian, supports sentence embedding extraction for clustering and semantic search

USER Base

Developed by deepvk

A sentence embedding extraction model specifically designed for Russian, capable of mapping sentences and paragraphs into a 768-dimensional dense vector space, suitable for tasks like clustering or semantic search.

Text Embedding

Safetensors

OtherOpen Source License:Apache-2.0 #Russian sentence embedding #Semantic search optimization #Multi-task prompts

Downloads 2,337

Release Time : 6/10/2024

Model Overview

USER is a Russian universal sentence encoder based on sentence-transformers, specifically trained for Russian and applicable to various natural language processing tasks.

Model Features

Russian optimization

Specifically trained for Russian, excelling in Russian language tasks

Multi-stage training

Adopts a two-stage training process combining contrastive pre-training and model fusion techniques

Prompt optimization

Distinguishes different task types through query and passage prompts

Lightweight and efficient

Only 85M parameters, achieving optimal performance among models of similar scale

Model Capabilities

Sentence embedding extraction

Semantic similarity calculation

Text clustering

Information retrieval

Feature extraction

Use Cases

Information retrieval

Q&A systems

Used to match user queries with relevant document passages

Achieves a recall@100 of 0.763 on the MIRACL dataset

Text analysis

Semantic similarity calculation

Calculates semantic similarity between two sentences or paragraphs

Scores an average of 0.772 on the Encodechka benchmark

Text clustering

Automatically groups texts with similar content

🚀 USER-base

The Universal Sentence Encoder for Russian (USER) is a sentence-transformer model designed specifically for extracting embeddings in the Russian language. It maps sentences and paragraphs into a 768-dimensional dense vector space, which can be used for tasks like clustering or semantic search.

This model is initialized from deepvk/deberta-v1-base and is trained to work solely with the Russian language. Its performance on other languages has not been evaluated.

🚀 Quick Start

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer

queries = [
  "Когда был спущен на воду первый миноносец «Спокойный»?",
  "Есть ли нефть в Удмуртии?"
]
passages = [
  "Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
  "Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
]

model = SentenceTransformer("deepvk/USER-base")
# Prompt should be specified according to the task (either 'query' or 'passage').
passage_embeddings = model.encode(passages, normalize_embeddings=True, prompt_name='passage')
# For tasks other than retrieval, you can simply use the `query` prompt, which is set by default.
query_embeddings = model.encode(queries, normalize_embeddings=True)

However, you can use model directly with transformers

import torch.nn.functional as F
from torch import Tensor, inference_mode
from transformers import AutoTokenizer, AutoModel

def average_pool(
  last_hidden_states: Tensor,
  attention_mask: Tensor
) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(
      ~attention_mask[..., None].bool(), 0.0
    )
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

# You should manually add prompts when using the model directly. Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
input_texts = [
  "query: Когда был спущен на воду первый миноносец «Спокойный»?",
  "query: Есть ли нефть в Удмуртии?",
  "passage: Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
  "passage: Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
]

tokenizer = AutoTokenizer.from_pretrained("deepvk/USER-base")
model = AutoModel.from_pretrained("deepvk/USER-base")

batch_dict = tokenizer(
  input_texts, padding=True, truncation=True, return_tensors="pt"
)
with inference_mode():
  outputs = model(**batch_dict)
  embeddings = average_pool(
    outputs.last_hidden_state, batch_dict["attention_mask"]
  )
  embeddings = F.normalize(embeddings, p=2, dim=1)

# Scores for query-passage
scores = (embeddings[:2] @ embeddings[2:].T) * 100
# [[55.86, 30.95],
#  [22.82, 59.46]]
print(scores.round(decimals=2))

⚠️ Important Note

Each input text should start with "query: " or "passage: ". For tasks other than retrieval, you can simply use the "query: " prefix.

✨ Features

Russian-specific: Designed exclusively for the Russian language, mapping sentences and paragraphs to a 768-dimensional dense vector space.
Versatile applications: Can be used for tasks like clustering or semantic search.

📦 Installation

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

queries = [
  "Когда был спущен на воду первый миноносец «Спокойный»?",
  "Есть ли нефть в Удмуртии?"
]
passages = [
  "Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
  "Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
]

model = SentenceTransformer("deepvk/USER-base")
# Prompt should be specified according to the task (either 'query' or 'passage').
passage_embeddings = model.encode(passages, normalize_embeddings=True, prompt_name='passage')
# For tasks other than retrieval, you can simply use the `query` prompt, which is set by default.
query_embeddings = model.encode(queries, normalize_embeddings=True)

Advanced Usage

import torch.nn.functional as F
from torch import Tensor, inference_mode
from transformers import AutoTokenizer, AutoModel

def average_pool(
  last_hidden_states: Tensor,
  attention_mask: Tensor
) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(
      ~attention_mask[..., None].bool(), 0.0
    )
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

# You should manually add prompts when using the model directly. Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
input_texts = [
  "query: Когда был спущен на воду первый миноносец «Спокойный»?",
  "query: Есть ли нефть в Удмуртии?",
  "passage: Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
  "passage: Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
]

tokenizer = AutoTokenizer.from_pretrained("deepvk/USER-base")
model = AutoModel.from_pretrained("deepvk/USER-base")

batch_dict = tokenizer(
  input_texts, padding=True, truncation=True, return_tensors="pt"
)
with inference_mode():
  outputs = model(**batch_dict)
  embeddings = average_pool(
    outputs.last_hidden_state, batch_dict["attention_mask"]
  )
  embeddings = F.normalize(embeddings, p=2, dim=1)

# Scores for query-passage
scores = (embeddings[:2] @ embeddings[2:].T) * 100
# [[55.86, 30.95],
#  [22.82, 59.46]]
print(scores.round(decimals=2))

🔧 Technical Details

Training Strategy

We aimed to follow the bge-base-en model training algorithm, but made several improvements along the way:

Initialization: The model is initialized from deepvk/deberta-v1-base.
First-stage: Contrastive pre-training with weak supervision on the Russian part of mMarco corpus.
Second-stage: Supervised fine-tuning two different models based on data symmetry and then merging via LM-Cocktail.

Dataset

During model development, we additionally collected 2 datasets: deepvk/ru-HNP and deepvk/ru-WANLI.

Property	Details
Symmetric Datasets	AllNLI (282,644), MedNLI (3,699), RCB (392), Terra (1,359), Tapaco (91,240), Opus100 (1,000,000), BiblePar (62,195), deepvk/ru-WANLI (35,455), deepvk/ru-HNP (500,000)
Asymmetric Datasets	MIRACL (10,000), MLDR (1,864), Lenta (185,972), Mlsum (51,112), Mr-TyDi (536,600), Panorama (11,024), PravoIsrael (26,364), Xlsum (124,486), Fialka-v1 (130,000), RussianKeywords (16,461), Gazeta (121,928), Gsm8k-ru (7,470), DSumRu (27,191), SummDialogNews (75,700)
Total positive pairs	3,352,653
Total negative pairs	792,644 (negative pairs from AIINLI, MIRACL, deepvk/ru-WANLI, deepvk/ru-HNP)

For all labeled datasets, we only use its training set for fine-tuning. For datasets Gazeta, Mlsum, Xlsum: pairs (title/text) and (title/summary) are combined and used as asymmetric data. AllNLI is a Russian translation of the combination of SNLI, MNLI, and ANLI.

📚 Documentation

Experiments

As a baseline, we chose the current top models from the encodechka leaderboard table. In addition, we evaluated the model on the Russian subset of MTEB, which includes 10 tasks. Unfortunately, we could not validate the bge-m3 on some MTEB tasks, specifically clustering, due to excessive computational resources. Besides these two benchmarks, we also evaluated the models on the MIRACL. All experiments were conducted using an NVIDIA TESLA A100 40 GB GPU. We used validation scripts from the official repositories for each of the tasks.

Model	Size (w/o Embeddings)	Encodechka (Mean S)	MTEB (Mean Ru)	Miracl (Recall@100)
`bge-m3`	303	0.786	0.694	0.959
`multilingual-e5-large`	303	0.78	0.665	0.927
`USER` (this model)	85	0.772	0.666	0.763
`paraphrase-multilingual-mpnet-base-v2`	85	0.76	0.625	0.149
`multilingual-e5-base`	85	0.756	0.645	0.915
`LaBSE-en-ru`	85	0.74	0.599	0.327
`sn-xlm-roberta-base-snli-mnli-anli-xnli`	85	0.74	0.593	0.08

Our solution outperforms all other models of the same size on both Encodechka and MTEB. Given that the model is slightly underperforming in retrieval tasks relative to existing solutions, we aim to address this in our future research.

📄 License

This project is licensed under the Apache 2.0 license.

📚 FAQ

Do I need to add the prefix "query: " and "passage: " to input texts?

Yes, this is how the model is trained, otherwise you will see a performance degradation. Here are some rules of thumb:

Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.

📖 Citations

@misc{deepvk2024user,
    title={USER: Universal Sentence Encoder for Russian},
    author={Malashenko, Boris and  Zemerov, Anton and Spirin, Egor},
    url={https://huggingface.co/datasets/deepvk/USER-base},
    publisher={Hugging Face},
    year={2024},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご