🚀 Sentence Similarity Model - rubert-tiny2
This is a small Russian BERT - based encoder that generates high - quality sentence embeddings, suitable for sentence similarity tasks.
🚀 Quick Start
This is an updated version of cointegrated/rubert-tiny, a small Russian BERT - based encoder capable of generating high - quality sentence embeddings. For more details, refer to this post in Russian.
The differences from the previous version are as follows:
- Larger vocabulary: 83,828 tokens instead of 29,564.
- Longer supported sequences: 2048 instead of 512.
- Sentence embeddings approximate LaBSE more closely than before.
- Meaningful segment embeddings (tuned on the NLI task).
- The model focuses solely on the Russian language.
The model can be used directly to generate sentence embeddings (e.g., for KNN classification of short texts) or fine - tuned for downstream tasks.
📦 Installation
To use this model, you need to install the necessary libraries. You can install them using the following command:
pip install transformers sentencepiece
💻 Usage Examples
Basic Usage
You can generate sentence embeddings using the following code:
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny2")
model = AutoModel.from_pretrained("cointegrated/rubert-tiny2")
def embed_bert_cls(text, model, tokenizer):
t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**{k: v.to(model.device) for k, v in t.items()})
embeddings = model_output.last_hidden_state[:, 0, :]
embeddings = torch.nn.functional.normalize(embeddings)
return embeddings[0].cpu().numpy()
print(embed_bert_cls('привет мир', model, tokenizer).shape)
Advanced Usage
You can also use the model with sentence_transformers
:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('cointegrated/rubert-tiny2')
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(embeddings)
📄 License
This model is released under the MIT license.
📋 Model Information
Property |
Details |
Pipeline Tag |
Sentence Similarity |
Tags |
Russian, Fill - Mask, Pretraining, Embeddings, Masked - LM, Tiny, Feature - Extraction, Sentence - Similarity, Sentence - Transformers, Transformers |
License |
MIT |
Widget Example |
"Миниатюрная модель для [MASK] разных задач." |