🚀 USER-bge-m3
The Universal Sentence Encoder for Russian (USER) is a sentence-transformer model designed specifically for extracting embeddings in the Russian language. It maps sentences and paragraphs to a 1024-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.
This model is initialized from TatonkaHF/bge-m3_en_ru
, a shrunk version of the baai/bge-m3
model, and is trained primarily for the Russian language. Its performance on other languages has not been evaluated.
✨ Features
- Russian-specific: Tailored to work effectively with the Russian language.
- Dense vector representation: Maps text to a 1024-dimensional dense vector space.
- Versatile applications: Suitable for clustering and semantic search tasks.
📦 Installation
Using this model becomes easy when you have sentence-transformers
installed:
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
input_texts = [
"Когда был спущен на воду первый миноносец «Спокойный»?",
"Есть ли нефть в Удмуртии?",
"Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
"Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
]
model = SentenceTransformer("deepvk/USER-bge-m3")
embeddings = model.encode(input_texts, normalize_embeddings=True)
Advanced Usage
import torch.nn.functional as F
from torch import Tensor, inference_mode
from transformers import AutoTokenizer, AutoModel
input_texts = [
"Когда был спущен на воду первый миноносец «Спокойный»?",
"Есть ли нефть в Удмуртии?",
"Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
"Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
]
tokenizer = AutoTokenizer.from_pretrained("deepvk/USER-bge-m3")
model = AutoModel.from_pretrained("deepvk/USER-bge-m3")
model.eval()
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = model_output[0][:, 0]
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
scores = (sentence_embeddings[:2] @ sentence_embeddings[2:].T)
Also, you can use the native FlagEmbedding library for evaluation. Usage is described in the bge-m3
model card.
📚 Documentation
Training Details
We follow the training algorithm of the USER-base
model, with several modifications due to the use of a different backbone.
Initialization: TatonkaHF/bge-m3_en_ru
– a shrunk version of baai/bge-m3
that supports only Russian and English tokens.
Fine-tuning: Supervised fine-tuning of two different models based on data symmetry, followed by merging via LM-Cocktail
:
- Since we split the data, we can additionally apply the AnglE loss to the symmetric model, which improves performance on symmetric tasks.
- Finally, we add the original
bge-m3
model to the two obtained models to prevent catastrophic forgetting, and tune the weights for the merger using LM-Cocktail
to produce the final model, USER-bge-m3.
Dataset
During model development, we additionally collect 2 datasets:
deepvk/ru-HNP
and
deepvk/ru-WANLI
.
Total positive pairs: 2,240,961
Total negative pairs: 792,644 (negative pairs from AIINLI, MIRACL, deepvk/ru-WANLI, deepvk/ru-HNP)
For all labeled datasets, we only use the training set for fine-tuning. For the Gazeta, Mlsum, and Xlsum datasets, pairs (title/text) and (title/summary) are combined and used as asymmetric data.
AllNLI
is a Russian translation of the combination of SNLI, MNLI, and ANLI.
Experiments
We compare our model with the basic baai/bge-m3
on the encodechka
benchmark. Additionally, we evaluate the model on the Russian subset of MTEB
for Classification, Reranking, Multilabel Classification, STS, Retrieval, and PairClassification tasks. We use validation scripts from the official repositories for each task.
Results on encodechka:
Model |
Mean S |
Mean S+W |
STS |
PI |
NLI |
SA |
TI |
IA |
IC |
ICX |
NE1 |
NE2 |
baai/bge-m3 |
0.787 |
0.696 |
0.86 |
0.75 |
0.51 |
0.82 |
0.97 |
0.79 |
0.81 |
0.78 |
0.24 |
0.42 |
USER-bge-m3 |
0.799 |
0.709 |
0.87 |
0.76 |
0.58 |
0.82 |
0.97 |
0.79 |
0.81 |
0.78 |
0.28 |
0.43 |
Results on MTEB:
Type |
baai/bge-m3 |
USER-bge-m3 |
Average (30 datasets) |
0.689 |
0.706 |
Classification Average (12 datasets) |
0.571 |
0.594 |
Reranking Average (2 datasets) |
0.698 |
0.688 |
MultilabelClassification (2 datasets) |
0.343 |
0.359 |
STS Average (4 datasets) |
0.735 |
0.753 |
Retrieval Average (6 datasets) |
0.945 |
0.934 |
PairClassification Average (4 datasets) |
0.784 |
0.833 |
Limitations
We did not thoroughly evaluate the model's ability for sparse and multi-vec encoding.
Citations
@misc{deepvk2024user,
title={USER: Universal Sentence Encoder for Russian},
author={Malashenko, Boris and Zemerov, Anton and Spirin, Egor},
url={https://huggingface.co/datasets/deepvk/USER-base},
publisher={Hugging Face}
year={2024},
}
📄 License
This project is licensed under the Apache 2.0 License.