🚀 SciRus-tiny
SciRus-tiny is a model designed to generate embeddings for scientific texts in both Russian and English. It was trained on data from eLibrary using contrastive techniques detailed in a habr post. The model has achieved high metric values on the ruSciBench benchmark.
🚀 Quick Start
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F
import torch
tokenizer = AutoTokenizer.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
model = AutoModel.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
def get_sentence_embedding(title, abstract, model, tokenizer, max_length=None):
sentence = '</s>'.join([title, abstract])
encoded_input = tokenizer(
[sentence], padding=True, truncation=True, return_tensors='pt', max_length=max_length).to(model.device)
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
return sentence_embeddings.cpu().detach().numpy()[0]
print(get_sentence_embedding('some title', 'some abstract', model, tokenizer).shape)
Advanced Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('mlsa-iai-msu-lab/sci-rus-tiny')
embeddings = model.encode(['some title' + '</s>' + 'some abstract'])
print(embeddings[0].shape)
📚 Documentation
Authors
The benchmark was developed by the MLSA Lab of the Institute for AI, MSU.
Acknowledgement
This research is part of the project #23-Ш05-21 SES MSU "Development of mathematical methods of machine learning for processing large-volume textual scientific information". We would like to thank eLibrary for providing the datasets.
Contacts
Nikolai Gerasimenko (nikgerasimenko@gmail.com), Alexey Vatolin (vatolinalex@gmail.com)
Citation
@article{Gerasimenko2024,
author = {Gerasimenko, N. and Vatolin, A. and Ianina, A. and Vorontsov, K.},
title = {SciRus: Tiny and Powerful Multilingual Encoder for Scientific Texts},
journal = {Doklady Mathematics},
year = {2024},
volume = {110},
number = {1},
pages = {S193--S202},
month = {dec},
issn = {1531-8362},
doi = {10.1134/S1064562424602178},
url = {https://doi.org/10.1134/S1064562424602178}
}
📄 License
This project is licensed under the MIT license.
📦 Model Information
Property |
Details |
Pipeline Tag |
sentence-similarity |
Tags |
russian, fill-mask, pretraining, embeddings, masked-lm, tiny, feature-extraction, sentence-similarity, sentence-transformers, transformers |
Widget Text |
Метод опорных векторов |