rubert-tiny2 Open-Source Russian Encoder - Generate High-Quality Sentence Embeddings for Free

Home

Rubert Tiny2

Developed by cointegrated

A compact BERT-based Russian encoder capable of generating high-quality sentence embeddings

Text Embedding

Transformers

OtherOpen Source License:MIT #Russian sentence embeddings #Short text classification #Efficient BERT

Downloads 585.48k

Release Time : 3/2/2022

Model Overview

This is an upgraded version of rubert-tiny, specialized for Russian language processing, suitable for generating sentence embeddings or fine-tuning for downstream tasks.

Model Features

Expanded vocabulary

Vocabulary increased from 29,564 to 83,828 tokens, enhancing model expressiveness

Long sequence support

Maximum supported sequence length extended from 512 to 2048

High-quality sentence embeddings

Sentence embeddings closer to LaBSE performance

Optimized segment embeddings

Tuned for NLI tasks with meaningful segment embeddings

Specialized for Russian

The model is specifically optimized for Russian language processing

Model Capabilities

Generate sentence embeddings

Short text classification

Sentence similarity calculation

Masked language modeling

Use Cases

Text processing

Short text classification

Classify short texts using methods like KNN

Semantic search

Perform semantic similarity searches based on sentence embeddings

🚀 Sentence Similarity Model - rubert-tiny2

This is a small Russian BERT - based encoder that generates high - quality sentence embeddings, suitable for sentence similarity tasks.

🚀 Quick Start

This is an updated version of cointegrated/rubert-tiny, a small Russian BERT - based encoder capable of generating high - quality sentence embeddings. For more details, refer to this post in Russian.

The differences from the previous version are as follows:

Larger vocabulary: 83,828 tokens instead of 29,564.
Longer supported sequences: 2048 instead of 512.
Sentence embeddings approximate LaBSE more closely than before.
Meaningful segment embeddings (tuned on the NLI task).
The model focuses solely on the Russian language.

The model can be used directly to generate sentence embeddings (e.g., for KNN classification of short texts) or fine - tuned for downstream tasks.

📦 Installation

To use this model, you need to install the necessary libraries. You can install them using the following command:

pip install transformers sentencepiece

💻 Usage Examples

Basic Usage

You can generate sentence embeddings using the following code:

# pip install transformers sentencepiece
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny2")
model = AutoModel.from_pretrained("cointegrated/rubert-tiny2")
# model.cuda()  # uncomment it if you have a GPU

def embed_bert_cls(text, model, tokenizer):
    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
    embeddings = model_output.last_hidden_state[:, 0, :]
    embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings[0].cpu().numpy()

print(embed_bert_cls('привет мир', model, tokenizer).shape)
# (312,)

Advanced Usage

You can also use the model with sentence_transformers:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('cointegrated/rubert-tiny2')
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(embeddings)

📄 License

This model is released under the MIT license.

📋 Model Information

Property	Details
Pipeline Tag	Sentence Similarity
Tags	Russian, Fill - Mask, Pretraining, Embeddings, Masked - LM, Tiny, Feature - Extraction, Sentence - Similarity, Sentence - Transformers, Transformers
License	MIT
Widget Example	"Миниатюрная модель для [MASK] разных задач."

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご