language:
pipeline_tag: sentence-similarity
tags:
- russian
- pretraining
- embeddings
- feature-extraction
- sentence-similarity
- sentence-transformers
- transformers
license: mit
base_model: cointegrated/LaBSE-en-ru
Базовый Bert для Semantic text similarity (STS) на GPU
Качественная модель BERT для расчетов эмбеддингов предложений на русском языке. Модель основана на cointegrated/LaBSE-en-ru - имеет аналогичные размеры контекста (512), ембеддинга (768) и быстродействие.
Использование модели с библиотекой transformers
:
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sergeyzh/LaBSE-ru-sts")
model = AutoModel.from_pretrained("sergeyzh/LaBSE-ru-sts")
def embed_bert_cls(text, model, tokenizer):
t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**{k: v.to(model.device) for k, v in t.items()})
embeddings = model_output.last_hidden_state[:, 0, :]
embeddings = torch.nn.functional.normalize(embeddings)
return embeddings[0].cpu().numpy()
print(embed_bert_cls('привет мир', model, tokenizer).shape)
Использование с sentence_transformers
:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('sergeyzh/LaBSE-ru-sts')
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(util.dot_score(embeddings, embeddings))
Метрики
Оценки модели на бенчмарке encodechka:
Задачи:
- Semantic text similarity (STS);
- Paraphrase identification (PI);
- Natural language inference (NLI);
- Sentiment analysis (SA);
- Toxicity identification (TI).
Быстродействие и размеры
Оценки модели на бенчмарке encodechka:
Оценки модели на бенчмарке ruMTEB:
Model Name |
Metric |
sbert_large_ mt_nlu_ru |
sbert_large_ nlu_ru |
LaBSE-ru-sts |
LaBSE-ru-turbo |
multilingual-e5-small |
multilingual-e5-base |
multilingual-e5-large |
CEDRClassification |
Accuracy |
0.368 |
0.358 |
0.418 |
0.451 |
0.401 |
0.423 |
0.448 |
GeoreviewClassification |
Accuracy |
0.397 |
0.400 |
0.406 |
0.438 |
0.447 |
0.461 |
0.497 |
GeoreviewClusteringP2P |
V-measure |
0.584 |
0.590 |
0.626 |
0.644 |
0.586 |
0.545 |
0.605 |
HeadlineClassification |
Accuracy |
0.772 |
0.793 |
0.633 |
0.688 |
0.732 |
0.757 |
0.758 |
InappropriatenessClassification |
Accuracy |
0.646 |
0.625 |
0.599 |
0.615 |
0.592 |
0.588 |
0.616 |
KinopoiskClassification |
Accuracy |
0.503 |
0.495 |
0.496 |
0.521 |
0.500 |
0.509 |
0.566 |
RiaNewsRetrieval |
NDCG@10 |
0.214 |
0.111 |
0.651 |
0.694 |
0.700 |
0.702 |
0.807 |
RuBQReranking |
MAP@10 |
0.561 |
0.468 |
0.688 |
0.687 |
0.715 |
0.720 |
0.756 |
RuBQRetrieval |
NDCG@10 |
0.298 |
0.124 |
0.622 |
0.657 |
0.685 |
0.696 |
0.741 |
RuReviewsClassification |
Accuracy |
0.589 |
0.583 |
0.599 |
0.632 |
0.612 |
0.630 |
0.653 |
RuSTSBenchmarkSTS |
Pearson correlation |
0.712 |
0.588 |
0.788 |
0.822 |
0.781 |
0.796 |
0.831 |
RuSciBenchGRNTIClassification |
Accuracy |
0.542 |
0.539 |
0.529 |
0.569 |
0.550 |
0.563 |
0.582 |
RuSciBenchGRNTIClusteringP2P |
V-measure |
0.522 |
0.504 |
0.486 |
0.517 |
0.511 |
0.516 |
0.520 |
RuSciBenchOECDClassification |
Accuracy |
0.438 |
0.430 |
0.406 |
0.440 |
0.427 |
0.423 |
0.445 |
RuSciBenchOECDClusteringP2P |
V-measure |
0.473 |
0.464 |
0.426 |
0.452 |
0.443 |
0.448 |
0.450 |
SensitiveTopicsClassification |
Accuracy |
0.285 |
0.280 |
0.262 |
0.272 |
0.228 |
0.234 |
0.257 |
TERRaClassification |
Average Precision |
0.520 |
0.502 |
0.587 |
0.585 |
0.551 |
0.550 |
0.584 |
Model Name |
Metric |
sbert_large_ mt_nlu_ru |
sbert_large_ nlu_ru |
LaBSE-ru-sts |
LaBSE-ru-turbo |
multilingual-e5-small |
multilingual-e5-base |
multilingual-e5-large |
Classification |
Accuracy |
0.554 |
0.552 |
0.524 |
0.558 |
0.551 |
0.561 |
0.588 |
Clustering |
V-measure |
0.526 |
0.519 |
0.513 |
0.538 |
0.513 |
0.503 |
0.525 |
MultiLabelClassification |
Accuracy |
0.326 |
0.319 |
0.340 |
0.361 |
0.314 |
0.329 |
0.353 |
PairClassification |
Average Precision |
0.520 |
0.502 |
0.587 |
0.585 |
0.551 |
0.550 |
0.584 |
Reranking |
MAP@10 |
0.561 |
0.468 |
0.688 |
0.687 |
0.715 |
0.720 |
0.756 |
Retrieval |
NDCG@10 |
0.256 |
0.118 |
0.637 |
0.675 |
0.697 |
0.699 |
0.774 |
STS |
Pearson correlation |
0.712 |
0.588 |
0.788 |
0.822 |
0.781 |
0.796 |
0.831 |
Average |
Average |
0.494 |
0.438 |
0.582 |
0.604 |
0.588 |
0.594 |
0.630 |