š smartmind/roberta-ko-small-tsdae
This is a sentence-transformers model that maps sentences and paragraphs to a 256-dimensional dense vector space. It can be used for tasks such as clustering or semantic search. It's a Korean roberta small model pretrained with TSDAE. The model can be directly used for calculating sentence similarity or fine-tuned according to specific needs.
š Quick Start
⨠Features
- Maps sentences and paragraphs to a 256-dimensional dense vector space.
- Can be used for clustering or semantic search.
- Can be directly used for sentence similarity calculation or fine-tuned.
š¦ Installation
To use this model, you need to install sentence-transformers. You can install it using the following command:
pip install -U sentence-transformers
š» Usage Examples
Basic Usage
Usage with Sentence-Transformers
After installing sentence-transformers, you can directly load the model as follows:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('smartmind/roberta-ko-small-tsdae')
embeddings = model.encode(sentences)
print(embeddings)
The following is an example of calculating the similarity of multiple sentences using the functions of sentence-transformers:
from sentence_transformers import util
sentences = [
"ėķ민źµģ ģėė ģģøģ
ėė¤.",
"미źµģ ģėė ė“ģģ“ ģėėė¤.",
"ėķ민źµģ ģė ģźøģ ģ ė “ķ ķøģ
ėė¤.",
"ģģøģ ėķ민źµģ ģėģ
ėė¤.",
"ģ¤ė ģģøģ ķ루ģ¢
ģ¼ ė§ģ",
]
paraphrase = util.paraphrase_mining(model, sentences)
for score, i, j in paraphrase:
print(f"{sentences[i]}\t\t{sentences[j]}\t\t{score:.4f}")
ėķ민źµģ ģėė ģģøģ
ėė¤. ģģøģ ėķ민źµģ ģėģ
ėė¤. 0.7616
ėķ민źµģ ģėė ģģøģ
ėė¤. 미źµģ ģėė ė“ģģ“ ģėėė¤. 0.7031
ėķ민źµģ ģėė ģģøģ
ėė¤. ėķ민źµģ ģė ģźøģ ģ ė “ķ ķøģ
ėė¤. 0.6594
미źµģ ģėė ė“ģģ“ ģėėė¤. ģģøģ ėķ민źµģ ģėģ
ėė¤. 0.6445
ėķ민źµģ ģė ģźøģ ģ ė “ķ ķøģ
ėė¤. ģģøģ ėķ민źµģ ģėģ
ėė¤. 0.4915
미źµģ ģėė ė“ģģ“ ģėėė¤. ėķ민źµģ ģė ģźøģ ģ ė “ķ ķøģ
ėė¤. 0.4785
ģģøģ ėķ민źµģ ģėģ
ėė¤. ģ¤ė ģģøģ ķ루ģ¢
ģ¼ ė§ģ 0.4119
ėķ민źµģ ģėė ģģøģ
ėė¤. ģ¤ė ģģøģ ķ루ģ¢
ģ¼ ė§ģ 0.3520
미źµģ ģėė ė“ģģ“ ģėėė¤. ģ¤ė ģģøģ ķ루ģ¢
ģ¼ ė§ģ 0.2550
ėķ민źµģ ģė ģźøģ ģ ė “ķ ķøģ
ėė¤. ģ¤ė ģģøģ ķ루ģ¢
ģ¼ ė§ģ 0.1896
Usage without Sentence-Transformers
If you don't install sentence-transformers, you can use the model as follows:
from transformers import AutoTokenizer, AutoModel
import torch
def cls_pooling(model_output, attention_mask):
return model_output[0][:,0]
sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained('smartmind/roberta-ko-small-tsdae')
model = AutoModel.from_pretrained('smartmind/roberta-ko-small-tsdae')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
š Documentation
Evaluation Results
The following scores were obtained on the klue STS dataset. These scores were obtained without fine-tuning on this dataset.
Split |
Cosine Pearson |
Cosine Spearman |
Euclidean Pearson |
Euclidean Spearman |
Manhattan Pearson |
Manhattan Spearman |
Dot Pearson |
Dot Spearman |
Train |
0.8735 |
0.8676 |
0.8268 |
0.8357 |
0.8248 |
0.8336 |
0.8449 |
0.8383 |
Validation |
0.5409 |
0.5349 |
0.4786 |
0.4657 |
0.4775 |
0.4625 |
0.5284 |
0.5252 |
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 508, 'do_lower_case': False}) with Transformer model: RobertaModel
(1): Pooling({'word_embedding_dimension': 256, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
š License
This project is licensed under the MIT license.