đ albert-small-kor-sbert-v1
This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search. This model is developed based on the albert-small-kor-v1 model using the SentenceBERT approach.
đ Quick Start
đĻ Installation
You can install the necessary library with the following command:
pip install -U sentence-transformers
đģ Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('bongsoo/albert-small-kor-sbert-v1')
embeddings = model.encode(sentences)
print(embeddings)
Advanced Usage
Without sentence-transformers, you can use the model as follows. First, pass your input through the transformer model, then apply the appropriate pooling operation on top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
def cls_pooling(model_output, attention_mask):
return model_output[0][:,0]
sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained('bongsoo/albert-small-kor-sbert-v1')
model = AutoModel.from_pretrained('bongsoo/albert-small-kor-sbert-v1')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
đ Documentation
Evaluation Results
- For performance measurement, the following Korean (kor) and English (en) evaluation corpora were used:
- Korean: korsts (1,379 sentence pairs) and klue-sts (519 sentence pairs)
- English: stsb_multi_mt (1,376 sentence pairs) and glue:stsb (1,500 sentence pairs)
- The performance metric is cosin.spearman.
- Refer to the evaluation code here.
Model |
korsts |
klue-sts |
glue(stsb) |
stsb_multi_mt(en) |
distiluse-base-multilingual-cased-v2 |
0.7475 |
0.7855 |
0.8193 |
0.8075 |
paraphrase-multilingual-mpnet-base-v2 |
0.8201 |
0.7993 |
0.8907 |
0.8682 |
bongsoo/moco-sentencedistilbertV2.1 |
0.8390 |
0.8767 |
0.8805 |
0.8548 |
bongsoo/albert-small-kor-sbert-v1 |
0.8305 |
0.8588 |
0.8419 |
0.7965 |
For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net
Training
The albert-small-kor-v1 model was trained using the sts(10)-distil(10)-nli(3)-sts(10) approach.
Common Parameters
- do_lower_case=1, correct_bios=0, polling_mode=cls
1. STS
- Corpus: korsts (5,749) + kluestsV1.1 (11,668) + stsb_multi_mt (5,749) + mteb/sickr-sts (9,927) + glue stsb (5,749) (Total: 38,842)
- Parameters: lr: 1e-4, eps: 1e-6, warm_step=10%, epochs: 10, train_batch: 32, eval_batch: 64, max_token_len: 72
- Refer to the training code here.
2. Distillation
- Teacher model: paraphrase-multilingual-mpnet-base-v2 (max_token_len: 128)
- Corpus: news_talk_en_ko_train.tsv (English-Korean dialogue-news parallel corpus: 1.38M)
- Parameters: lr: 5e-5, eps: 1e-8, epochs: 10, train_batch: 32, eval/test_batch: 64, max_token_len: 128 (to match the teacher model)
- Refer to the training code here.
3. NLI
- Corpus: Training (967,852): kornli (550,152), kluenli (24,998), glue-mnli (392,702); Evaluation (3,519): korsts (1,500), kluests (519), gluests (1,500)
- Hyperparameters: lr: 3e-5, eps: 1e-8, warm_step=10%, epochs: 3, train/eval_batch: 64, max_token_len: 128
- Refer to the training code here.
đ§ Technical Details
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': True}) with Transformer model: AlbertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
đ License
No license information provided.
Citing & Authors
bongsoo