ruRoPEBert-classic-base-512 Open-source Russian Encoder Model - Supports Long Text Processing with Superior Quality!

Ruropebert Classic Base 512

Developed by Tochka-AI

A Russian encoder model based on the RoPEBert architecture, trained using cloning methods, supports 512-token context, and surpasses the original ruBert-base model in quality

Large Language Model

Transformers

Other#Russian text encoding #RoPE attention mechanism #Long text processing

Downloads 103

Release Time : 2/22/2024

Model Overview

This is a text encoder model optimized for Russian, primarily used for text feature extraction and classification tasks, developed based on the improved RoPEBert architecture

Model Features

RoPE architecture improvements

Bert architecture enhanced with Rotary Position Embedding (RoPE) technology for better positional encoding effects

Long context support

Natively supports 512-token context and can be scaled to handle longer texts via RoPE scaling

Efficient attention mechanism

Supports SDPA efficient attention implementation to improve computational efficiency

Built-in pooling

Provides mean and first_token_transform pooling methods for convenient text embedding vector extraction

Model Capabilities

Text feature extraction

Semantic similarity calculation

Text classification

Long text processing

Use Cases

Semantic understanding

Text similarity calculation

Calculate the semantic similarity between two Russian texts

Achieved through normalized embedding vectors and matrix multiplication

Text classification

Sentiment analysis

Classify sentiment tendencies in Russian texts

🚀 ruRoPEBert Classic Model for Russian language

This is an encoder model from Tochka AI based on the RoPEBert architecture. It uses the cloning method described in our article on Habr. The model is trained on the CulturaX dataset. Using the ai-forever/ruBert-base model as the original, our model surpasses it in quality according to the encodechka benchmark.

The model source code is available in the file modeling_rope_bert.py. The model is trained on contexts up to 512 tokens in length, but can be used on larger contexts. For better quality, consider using the version of this model with extended context - Tochka-AI/ruRoPEBert-classic-base-2k.

🚀 Quick Start

Prerequisites

⚠️ Important Note

The recommended version of transformers is 4.37.2 and higher. To load the model correctly, you must enable downloading code from the model's repository: trust_remote_code=True. This will download the modeling_rope_bert.py script and load the weights into the correct architecture. Otherwise, you can download this script manually and use classes from it directly to load the model.

💻 Usage Examples

🔍 Basic Usage (no efficient attention)

model_name = 'Tochka-AI/ruRoPEBert-classic-base-512'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation='eager')

⚡ With SDPA (efficient attention)

model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation='sdpa')

📈 Getting embeddings

The correct pooler (mean) is already built into the model architecture, which averages embeddings based on the attention mask. You can also select the pooler type (first_token_transform), which performs a learnable linear transformation on the first token.

To change the built - in pooler implementation, use the pooler_type parameter in the AutoModel.from_pretrained function.

test_batch = tokenizer.batch_encode_plus(["Привет, чем занят?", "Здравствуйте, чем вы занимаетесь?"], return_tensors='pt', padding=True)
with torch.inference_mode():
  pooled_output = model(**test_batch).pooler_output

In addition, you can calculate cosine similarities between texts in batch using normalization and matrix multiplication:

import torch.nn.functional as F
F.normalize(pooled_output, dim=1) @ F.normalize(pooled_output, dim=1).T

📊 Using as classifier

To load the model with a trainable classification head on top (change the num_labels parameter):

model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, attn_implementation='sdpa', num_labels=4)

📏 With RoPE scaling

Allowed types for RoPE scaling are: linear and dynamic. To extend the model's context window, you need to change the tokenizer's max length and add the rope_scaling parameter.

If you want to scale your model context by 2x:

tokenizer.model_max_length = 1024
model = AutoModel.from_pretrained(model_name,
                                  trust_remote_code=True,
                                  attn_implementation='sdpa',
                                  rope_scaling={'type': 'dynamic','factor': 2.0}
                                  ) # 2.0 for x2 scaling, 4.0 for x4, etc..

💡 Usage Tip

Don't forget to specify the dtype and device you need to use resources efficiently.

📊 Metrics

Evaluation of this model on the encodechka benchmark:

Model name	STS	PI	NLI	SA	TI	IA	IC	ICX	NE1	NE2	Avg S (no NE)	Avg S+W (with NE)
ruRoPEBert-classic-base-512	0.695	0.605	0.396	0.794	0.975	0.797	0.769	0.386	0.410	0.609	0.677	0.630
ai-forever/ruBert-base	0.670	0.533	0.391	0.773	0.975	0.783	0.765	0.384	-	-	0.659	-

👨‍💻 Authors

Sergei Bratchikov (Tochka AI Team, HF, GitHub)
Maxim Afanasiev (Tochka AI Team, HF, GitHub)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご