Open-source bge - m3 model - Free extraction of 1024-dimensional dense vectors for Russian texts!

USER Bge M3

Developed by deepvk

Russian universal sentence encoder, based on the sentence-transformers framework, specifically designed to extract 1024-dimensional dense vectors for Russian text

Text Embedding

Safetensors

OtherOpen Source License:Apache-2.0 #Russian sentence embeddings #Semantic similarity calculation #Multi-task fine-tuning

Downloads 339.46k

Release Time : 7/5/2024

Model Overview

This model maps Russian sentences and paragraphs to a 1024-dimensional dense vector space, suitable for tasks such as clustering or semantic search. Optimized for Russian language processing based on the bge-m3 model architecture.

Model Features

Russian language optimization

Specially optimized and trained for Russian text, excelling in Russian semantic understanding tasks

Multi-dataset training

Trained on multiple Russian datasets including ru-HNP and ru-WANLI

High-performance vector encoding

Generates 1024-dimensional dense vectors, supporting efficient similarity calculation and clustering analysis

Model Capabilities

Russian text vectorization

Semantic similarity calculation

Text clustering analysis

Feature extraction

Use Cases

Information retrieval

Russian semantic search

Building semantic matching functionality for Russian search engines

Average score of 0.799 on the encodechka benchmark

Text analysis

Russian text clustering

Topic clustering for Russian news or social media content

🚀 USER-bge-m3

The Universal Sentence Encoder for Russian (USER) is a sentence-transformer model designed specifically for extracting embeddings in the Russian language. It maps sentences and paragraphs to a 1024-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.

This model is initialized from TatonkaHF/bge-m3_en_ru, a shrunk version of the baai/bge-m3 model, and is trained primarily for the Russian language. Its performance on other languages has not been evaluated.

✨ Features

Russian-specific: Tailored to work effectively with the Russian language.
Dense vector representation: Maps text to a 1024-dimensional dense vector space.
Versatile applications: Suitable for clustering and semantic search tasks.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

input_texts = [
  "Когда был спущен на воду первый миноносец «Спокойный»?",
  "Есть ли нефть в Удмуртии?",
  "Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
  "Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
]

model = SentenceTransformer("deepvk/USER-bge-m3")
embeddings = model.encode(input_texts, normalize_embeddings=True)

Advanced Usage

import torch.nn.functional as F
from torch import Tensor, inference_mode
from transformers import AutoTokenizer, AutoModel

input_texts = [
  "Когда был спущен на воду первый миноносец «Спокойный»?",
  "Есть ли нефть в Удмуртии?",
  "Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
  "Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
]

tokenizer = AutoTokenizer.from_pretrained("deepvk/USER-bge-m3")
model = AutoModel.from_pretrained("deepvk/USER-bge-m3")
model.eval()

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
  model_output = model(**encoded_input) 
  # Perform pooling. In this case, cls pooling.
  sentence_embeddings = model_output[0][:, 0]

# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)

# [[0.5567, 0.3014],
#  [0.1701, 0.7122]]
scores = (sentence_embeddings[:2] @ sentence_embeddings[2:].T)

Also, you can use the native FlagEmbedding library for evaluation. Usage is described in the bge-m3 model card.

📚 Documentation

Training Details

We follow the training algorithm of the USER-base model, with several modifications due to the use of a different backbone.

Initialization: TatonkaHF/bge-m3_en_ru – a shrunk version of baai/bge-m3 that supports only Russian and English tokens.

Fine-tuning: Supervised fine-tuning of two different models based on data symmetry, followed by merging via LM-Cocktail:

Since we split the data, we can additionally apply the AnglE loss to the symmetric model, which improves performance on symmetric tasks.
Finally, we add the original bge-m3 model to the two obtained models to prevent catastrophic forgetting, and tune the weights for the merger using LM-Cocktail to produce the final model, USER-bge-m3.

Dataset

During model development, we additionally collect 2 datasets: deepvk/ru-HNP and deepvk/ru-WANLI.

Symmetric Dataset	Size	Asymmetric Dataset	Size
AllNLI	282 644	MIRACL	10 000
MedNLI	3 699	MLDR	1 864
RCB	392	Lenta	185 972
Terra	1 359	Mlsum	51 112
Tapaco	91 240	Mr-TyDi	536 600
deepvk/ru-WANLI	35 455	Panorama	11 024
deepvk/ru-HNP	500 000	PravoIsrael	26 364
		Xlsum	124 486
		Fialka-v1	130 000
		RussianKeywords	16 461
		Gazeta	121 928
		Gsm8k-ru	7 470
		DSumRu	27 191
		SummDialogNews	75 700

Total positive pairs: 2,240,961 Total negative pairs: 792,644 (negative pairs from AIINLI, MIRACL, deepvk/ru-WANLI, deepvk/ru-HNP)

For all labeled datasets, we only use the training set for fine-tuning. For the Gazeta, Mlsum, and Xlsum datasets, pairs (title/text) and (title/summary) are combined and used as asymmetric data.

AllNLI is a Russian translation of the combination of SNLI, MNLI, and ANLI.

Experiments

We compare our model with the basic baai/bge-m3 on the encodechka benchmark. Additionally, we evaluate the model on the Russian subset of MTEB for Classification, Reranking, Multilabel Classification, STS, Retrieval, and PairClassification tasks. We use validation scripts from the official repositories for each task.

Results on encodechka:

Model	Mean S	Mean S+W	STS	PI	NLI	SA	TI	IA	IC	ICX	NE1	NE2
`baai/bge-m3`	0.787	0.696	0.86	0.75	0.51	0.82	0.97	0.79	0.81	0.78	0.24	0.42
`USER-bge-m3`	0.799	0.709	0.87	0.76	0.58	0.82	0.97	0.79	0.81	0.78	0.28	0.43

Results on MTEB:

Type	`baai/bge-m3`	`USER-bge-m3`
Average (30 datasets)	0.689	0.706
Classification Average (12 datasets)	0.571	0.594
Reranking Average (2 datasets)	0.698	0.688
MultilabelClassification (2 datasets)	0.343	0.359
STS Average (4 datasets)	0.735	0.753
Retrieval Average (6 datasets)	0.945	0.934
PairClassification Average (4 datasets)	0.784	0.833

Limitations

We did not thoroughly evaluate the model's ability for sparse and multi-vec encoding.

Citations

@misc{deepvk2024user,
    title={USER: Universal Sentence Encoder for Russian},
    author={Malashenko, Boris and  Zemerov, Anton and Spirin, Egor},
    url={https://huggingface.co/datasets/deepvk/USER-base},
    publisher={Hugging Face}
    year={2024},
}

📄 License

This project is licensed under the Apache 2.0 License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご