Sci-rus-tiny Open-source Model - Easily Obtain Embedding Vectors of Russian and English Scientific Texts, Free and Practical!

Sci Rus Tiny

Developed by mlsa-iai-msu-lab

SciRus-tiny is a compact model for obtaining Russian and English scientific text embeddings, trained on eLibrary data using contrastive learning techniques.

Text Embedding

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Scientific Text Embedding #Russian-English Bilingual Support #Contrastive Learning Training

Downloads 369

Release Time : 12/17/2023

Model Overview

This model is specifically designed for processing Russian and English scientific texts, capable of generating high-quality embeddings suitable for tasks such as sentence similarity calculation.

Model Features

Multilingual Support

Supports processing of Russian and English scientific texts

Contrastive Learning Technique

Trained using contrastive learning, improving embedding quality

Scientific Text Optimization

Specially trained for scientific texts, excelling in scientific domains

Compact Model

Small model size, suitable for resource-constrained environments

Model Capabilities

Generate text embeddings

Calculate sentence similarity

Process scientific texts

Support Russian and English

Use Cases

Academic Research

Scientific Literature Retrieval

Finding relevant scientific literature through embedding similarity

Performs excellently on the ruSciBench benchmark

Paper Recommendation

Recommending related research papers based on content similarity

Text Analysis

Scientific Text Classification

Classifying scientific texts using embeddings

🚀 SciRus-tiny

SciRus-tiny is a model designed to generate embeddings for scientific texts in both Russian and English. It was trained on data from eLibrary using contrastive techniques detailed in a habr post. The model has achieved high metric values on the ruSciBench benchmark.

🚀 Quick Start

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F
import torch


tokenizer = AutoTokenizer.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
model = AutoModel.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
# model.cuda()  # if you want to use a GPU

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


def get_sentence_embedding(title, abstract, model, tokenizer, max_length=None):
    # Tokenize sentences
    sentence = '</s>'.join([title, abstract])
    encoded_input = tokenizer(
        [sentence], padding=True, truncation=True, return_tensors='pt', max_length=max_length).to(model.device)
    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)
    # Perform pooling
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    # Normalize embeddings
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
    return sentence_embeddings.cpu().detach().numpy()[0]

print(get_sentence_embedding('some title', 'some abstract', model, tokenizer).shape)
# (312,)

Advanced Usage

from sentence_transformers import SentenceTransformer


model = SentenceTransformer('mlsa-iai-msu-lab/sci-rus-tiny')
embeddings = model.encode(['some title' + '</s>' + 'some abstract'])
print(embeddings[0].shape)
# (312,)

📚 Documentation

Authors

The benchmark was developed by the MLSA Lab of the Institute for AI, MSU.

Acknowledgement

This research is part of the project #23-Ш05-21 SES MSU "Development of mathematical methods of machine learning for processing large-volume textual scientific information". We would like to thank eLibrary for providing the datasets.

Contacts

Nikolai Gerasimenko (nikgerasimenko@gmail.com), Alexey Vatolin (vatolinalex@gmail.com)

Citation

@article{Gerasimenko2024,
  author  = {Gerasimenko, N. and Vatolin, A. and Ianina, A. and Vorontsov, K.},
  title   = {SciRus: Tiny and Powerful Multilingual Encoder for Scientific Texts},
  journal = {Doklady Mathematics},
  year    = {2024},
  volume  = {110},
  number  = {1},
  pages   = {S193--S202},
  month   = {dec},
  issn    = {1531-8362},
  doi     = {10.1134/S1064562424602178},
  url     = {https://doi.org/10.1134/S1064562424602178}
}

📄 License

This project is licensed under the MIT license.

📦 Model Information

Property	Details
Pipeline Tag	sentence-similarity
Tags	russian, fill-mask, pretraining, embeddings, masked-lm, tiny, feature-extraction, sentence-similarity, sentence-transformers, transformers
Widget Text	Метод опорных векторов

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご