ruRoPEBert-e5-base-2k Open-Source Russian Sentence Encoder - Supports Long Context and Performs Well in Tests

Ruropebert E5 Base 2k

Developed by Tochka-AI

A Russian sentence encoder model based on the RoPEBert architecture, supporting a context length of 2048 tokens and excelling in the encodechka benchmark tests.

Text Embedding

Transformers

Other#Russian sentence embeddings #Long context support #RoPE scaling

Downloads 2,422

Release Time : 2/22/2024

Model Overview

A Russian sentence embedding model developed by Tochka AI, utilizing the RoPEBert architecture, primarily for feature extraction and sentence similarity calculation in Russian text.

Model Features

Long context support

Supports processing contexts up to 2048 tokens and can be extended to longer contexts.

Efficient attention mechanism

Supports SDPA efficient attention implementation to enhance processing speed.

RoPE scaling

Supports linear and dynamic RoPE scaling types, allowing for extended model context windows.

Built-in poolers

Includes built-in implementations of mean and first_token_transform poolers for direct sentence embedding extraction.

Model Capabilities

Russian text feature extraction

Sentence similarity calculation

Text classification

Long text processing

Use Cases

Text similarity

Sentence similarity calculation

Calculates semantic similarity between Russian sentences.

Measures sentence similarity via cosine similarity scores.

Text classification

Russian text classification

Can perform text classification tasks by adding a classification head.

🚀 ruRoPEBert Sentence Model for Russian language

This is an encoder model from Tochka AI based on the RoPEBert architecture. It uses the cloning method described in our article on Habr. The model aims to provide high - quality sentence encoding for the Russian language.

The CulturaX dataset was used for model training. The hivaze/ru - e5 - base (only English and Russian embeddings of intfloat/multilingual - e5 - base) model was used as the original. According to the S+W score of the encodechka benchmark, at the time of creation, this model surpasses it and all other models in quality.

The model source code is available in the file [modeling_rope_bert.py](https://huggingface.co/Tochka - AI/ruRoPEBert - e5 - base - 2k/blob/main/modeling_rope_bert.py). The model is trained on contexts up to 2048 tokens in length, but can be used on larger contexts.

🚀 Quick Start

✨ Features

Based on the RoPEBert architecture, providing high - quality encoding for Russian sentences.
Trained on the CulturaX dataset, with better performance than many other models according to the encodechka benchmark.
Can handle contexts up to 2048 tokens and can be extended with RoPE scaling.
Supports different attention implementations and pooler types.

📦 Installation

The recommended version of transformers is 4.37.2 and higher. To load the model correctly, you must enable downloading code from the model's repository: trust_remote_code=True, which will download the modeling_rope_bert.py script and load the weights into the correct architecture. Otherwise, you can download this script manually and use classes from it directly to load the model.

💻 Usage Examples

Basic Usage (no efficient attention)

model_name = 'Tochka-AI/ruRoPEBert-e5-base-2k'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation='eager')

With SDPA (efficient attention)

model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation='sdpa')

Getting embeddings

The correct pooler (mean) is already built into the model architecture, which averages embeddings based on the attention mask. You can also select the pooler type (first_token_transform), which performs a learnable linear transformation on the first token. To change the built - in pooler implementation, use the pooler_type parameter in the AutoModel.from_pretrained function.

test_batch = tokenizer.batch_encode_plus(["Привет, чем занят?", "Здравствуйте, чем вы занимаетесь?"], return_tensors='pt', padding=True)
with torch.inference_mode():
  pooled_output = model(**test_batch).pooler_output

In addition, you can calculate cosine similarities between texts in batch using normalization and matrix multiplication:

import torch.nn.functional as F
F.normalize(pooled_output, dim=1) @ F.normalize(pooled_output, dim=1).T

Using as classifier

To load the model with a trainable classification head on top (change the num_labels parameter):

model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, attn_implementation='sdpa', num_labels=4)

With RoPE scaling

Allowed types for RoPE scaling are: linear and dynamic. To extend the model's context window, you need to change the tokenizer's max length and add the rope_scaling parameter. If you want to scale your model context by 2x:

tokenizer.model_max_length = 4096
model = AutoModel.from_pretrained(model_name,
                                  trust_remote_code=True,
                                  attn_implementation='sdpa',
                                  rope_scaling={'type': 'dynamic','factor': 2.0}
                                  ) # 2.0 for x2 scaling, 4.0 for x4, etc..

⚠️ Important Note

Don't forget to specify the dtype and device you need to use resources efficiently.

📚 Documentation

The model is described in detail in our article on Habr. The source code is available in [modeling_rope_bert.py](https://huggingface.co/Tochka - AI/ruRoPEBert - e5 - base - 2k/blob/main/modeling_rope_bert.py).

🔧 Technical Details

The model is based on the RoPEBert architecture and uses the cloning method described in the article. It is trained on the CulturaX dataset. The performance is evaluated using the encodechka benchmark, and it shows better results compared to other models.

📄 License

No license information is provided in the original document.

Metrics

Evaluation of this model on the encodechka benchmark:

Model name	STS	PI	NLI	SA	TI	IA	IC	ICX	NE1	NE2	Avg S (no NE)	Avg S+W (with NE)
ruRoPEBert - e5 - base - 512	0.793	0.704	0.457	0.803	0.970	0.788	0.802	0.749	0.328	0.396	0.758	0.679
ruRoPEBert - e5 - base - 2k	0.787	0.708	0.460	0.804	0.970	0.792	0.803	0.749	0.402	0.423	0.759	0.689
intfloat/multilingual - e5 - base	0.834	0.704	0.458	0.795	0.964	0.782	0.803	0.740	0.234	0.373	0.76	0.668

Authors

Sergei Bratchikov (Tochka AI Team, HF, GitHub)
Maxim Afanasiev (Tochka AI Team, HF, GitHub)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご