đ uaritm/multilingual_en_ru_uk
This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.
A newer version of this model that adds Polish is available here: uaritm/multilingual_en_uk_pl_ru
đ Quick Start
⨠Features
- Multilingual Support: Supports multiple languages including Ukrainian (
uk
), English (en
), Polish (pl
), and Russian (ru
).
- Sentence Similarity: Ideal for sentence similarity tasks, clustering, and semantic search.
đĻ Installation
If you want to use this model, you need to install sentence-transformers first:
pip install -U sentence-transformers
đģ Usage Examples
Basic Usage
The model is used on the resource of multilingual analysis of patient complaints to determine the specialty of the doctor that is needed in this case: Virtual General Practice
You can test the quality and speed of the model.
When you have sentence-transformers installed, you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('uaritm/multilingual_en_ru_uk')
embeddings = model.encode(sentences)
print(embeddings)
Advanced Usage
Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained('uaritm/multilingual_en_ru_uk')
model = AutoModel.from_pretrained('uaritm/multilingual_en_ru_uk')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
đ Documentation
For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net
đ§ Technical Details
Training
The model was trained with the following parameters:
DataLoader:
torch.utils.data.dataloader.DataLoader
of length 17482 with parameters:
{'batch_size': 128, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
Loss:
sentence_transformers.losses.MSELoss.MSELoss
Parameters of the fit()-Method:
{
"epochs": 15,
"evaluation_steps": 1000,
"evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"eps": 1e-06,
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 10000,
"weight_decay": 0.01
}
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
đ License
This model is licensed under the Apache-2.0 license.
đ Model Information
Property |
Details |
Model Type |
Sentence Transformer |
Training Data |
ted_multi, Helsinki-NLP/tatoeba_mt |
Metrics |
MSE |
Library Name |
sentence-transformers |
đ Citing & Authors
@misc{Uaritm,
title={sentence-transformers: Semantic similarity of medical texts},
author={Vitaliy Ostashko},
year={2022},
url={https://aihealth.site},
}