
Model Overview
Model Features
Model Capabilities
Use Cases
đ moco-sentencedistilbertV2.1
This is a sentence embedding model that maps sentences and paragraphs into a 768-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.
Pipeline Tag
sentence-similarity
Tags
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- ko
- en
Widget
- Source Sentence: "ëíë¯ŧęĩė ėëë?"
- Sentences:
- "ėė¸íšëŗėë íęĩė´ ė ėš,ę˛Ŋė ,ëŦ¸í ė¤ėŦ ëėė´ë¤."
- "ëļė°ė ëíë¯ŧęĩė ė 2ė ëėė´ė ėĩëė í´ė ëŦŧëĨ ëėė´ë¤."
- "ė ėŖŧëë ëíë¯ŧęĩėė ė ëĒ í ę´ę´ė§ė´ë¤"
- "Seoul is the capital of Korea"
- "ė¸ė°ę´ėėë ëíë¯ŧęĩ ë¨ëëļ í´ėė ėë ę´ėėė´ë¤"
đ Quick Start
This is a sentence-transformers model. It maps sentences and paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search.
- This model is created by converting the bongsoo/mdistilbertV2.1 MLM model into a SentenceBERT model and then performing additional STS teacher-student distillation training.
- Vocab: 152,537 (32,989 new vocabs are added to the original 119,548 vocabs).
đĻ Installation
Using Sentence-Transformers
pip install -U sentence_transformers
Using HuggingFace Transformers
pip install transformers[torch]
đģ Usage Examples
Basic Usage (Sentence-Transformers)
from sentence_transformers import SentenceTransformer
sentences = ["ėė¸ė íęĩė´ ėëė´ë¤", "The capital of Korea is Seoul"]
model = SentenceTransformer('bongsoo/moco-sentencedistilbertV2.1')
embeddings = model.encode(sentences)
print(embeddings)
# Calculate cosine_scores using sklearn
# => The input embeddings should be 2D, like (1,768).
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
cosine_scores = 1 - (paired_cosine_distances(embeddings[0].reshape(1,-1), embeddings[1].reshape(1,-1)))
print(f'*cosine_score:{cosine_scores[0]}')
Outputs
[[ 0.27124503 -0.5836643 0.00736023 ... -0.0038319 0.01802095 -0.09652182]
[ 0.2765149 -0.5754248 0.00788184 ... 0.07659392 -0.07825544 -0.06120609]]
*cosine_score:0.9513546228408813
Advanced Usage (HuggingFace Transformers)
Without sentence-transformers, you can use the model as follows: First, pass your input through the transformer model, then apply the appropriate pooling operation on top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] # First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Sentences we want sentence embeddings for
sentences = ["ėė¸ė íęĩė´ ėëė´ë¤", "The capital of Korea is Seoul"]
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('bongsoo/moco-sentencedistilbertV2.1')
model = AutoModel.from_pretrained('bongsoo/moco-sentencedistilbertV2.1')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
# Calculate cosine_scores using sklearn
# => The input embeddings should be 2D, like (1,768).
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
cosine_scores = 1 - (paired_cosine_distances(sentence_embeddings[0].reshape(1,-1), sentence_embeddings[1].reshape(1,-1)))
print(f'*cosine_score:{cosine_scores[0]}')
Outputs
Sentence embeddings:
tensor([[ 0.2712, -0.5837, 0.0074, ..., -0.0038, 0.0180, -0.0965],
[ 0.2765, -0.5754, 0.0079, ..., 0.0766, -0.0783, -0.0612]])
*cosine_score:0.9513546228408813
đ Documentation
Evaluation Results
- For performance measurement, the following Korean (kor) and English (en) evaluation corpora are used:
- Korean: korsts (1,379 sentence pairs) and klue-sts (519 sentence pairs)
- English: stsb_multi_mt (1,376 sentence pairs) and glue:stsb (1,500 sentence pairs)
- The performance indicator is cosin.spearman/max (the maximum value among cosine, Euclidean, Manhattan, and doc).
- Refer to the evaluation measurement code here.
Model | korsts | klue-sts | glue(stsb) | stsb_multi_mt(en) |
---|---|---|---|---|
distiluse-base-multilingual-cased-v2 | 0.7475/0.7556 | 0.7855/0.7862 | 0.8193 | 0.8075/0.8168 |
paraphrase-multilingual-mpnet-base-v2 | 0.8201 | 0.7993 | 0.8907/0.8919 | 0.8682 |
bongsoo/sentencedistilbertV1.2 | 0.8198/0.8202 | 0.8584/0.8608 | 0.8739/0.8740 | 0.8377/0.8388 |
bongsoo/moco-sentencedistilbertV2.0 | 0.8124/0.8128 | 0.8470/0.8515 | 0.8773/0.8778 | 0.8371/0.8388 |
bongsoo/moco-sentencebertV2.0 | 0.8244/0.8277 | 0.8411/0.8478 | 0.8792/0.8796 | 0.8436/0.8456 |
bongsoo/moco-sentencedistilbertV2.1 | 0.8390/0.8398 | 0.8767/0.8808 | 0.8805/0.8816 | 0.8548 |
For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net
Training
The model was trained with the following parameters:
1. MLM Training
- Input Model: distilbert-base-multilingual-cased
- Corpus: Training - bongsoo/moco-corpus-kowiki2022 (7.6M), Evaluation - bongsoo/bongevalsmall
- Hyperparameters: Learning Rate: 5e-5, Epochs: 8, Batch Size: 32, Max Token Length: 128
- Vocab: 152,537 (32,989 new vocabs are added to the original 119,548 vocabs)
- Output Model: mdistilbertV2.1 (Size: 643MB)
- Training Time: 63h/1GPU (24GB/23.9GB used)
- Evaluation: Training Loss: 2.203400, Evaluation Loss: 2.972835, Perplexity: 23.43 (bong_eval: 1,500)
- Refer to the training code here.
2. STS Training
- Convert BERT to SentenceBERT.
- Input Model: mdistilbertV2.1 (Size: 643MB)
- Corpus: korsts (5,749) + kluestsV1.1 (11,668) + stsb_multi_mt (5,749) + mteb/sickr-sts (9,927) + glue stsb (5,749) (Total: 38,842)
- Hyperparameters: Learning Rate: 3e-5, Epochs: 800, Batch Size: 128, Max Token Length: 256
- Output Model: sbert-mdistilbertV2.1 (Size: 640MB)
- Training Time: 13h/1GPU (24GB/16.1GB used)
- Evaluation (cosin_spearman): 0.790 (Corpus: korsts(tune_test.tsv))
- Refer to the training code here.
3. Distillation Training
- Student Model: sbert-mdistilbertV2.1
- Teacher Model: paraphrase-multilingual-mpnet-base-v2 (max_token_len: 128)
- Corpus: news_talk_en_ko_train.tsv (English-Korean dialogue-news parallel corpus: 1.38M)
- Hyperparameters: Learning Rate: 5e-5, Epochs: 40, Batch Size: 128, Max Token Length: 128 (to match the teacher model)
- Output Model: sbert-mdistilbertV2.1-distil
- Training Time: 17h/1GPU (24GB/9GB used)
- Refer to the training code here.
4. STS Training
- Input Model: sbert-mdistilbertV2.1-distil
- Corpus: korsts (5,749) + kluestsV1.1 (11,668) + stsb_multi_mt (5,749) + mteb/sickr-sts (9,927) + glue stsb (5,749) (Total: 38,842)
- Hyperparameters: Learning Rate: 3e-5, Epochs: 1200, Batch Size: 128, Max Token Length: 256
- Output Model: moco-sentencedistilbertV2.1
- Training Time: 12h/1GPU (24GB/16.1GB used)
- Evaluation (cosin_spearman): 0.839 (Corpus: korsts(tune_test.tsv))
- Refer to the training code here.
For more details about the model creation process, refer to here.
Config
{
"_name_or_path": "../../data11/model/sbert/sbert-mdistilbertV2.1-distil",
"activation": "gelu",
"architectures": [
"DistilBertModel"
],
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"model_type": "distilbert",
"n_heads": 12,
"n_layers": 6,
"output_past": true,
"pad_token_id": 0,
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights_": true,
"torch_dtype": "float32",
"transformers_version": "4.21.2",
"vocab_size": 152537
}
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: DistilBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
Tokenizer Config
{
"cls_token": "[CLS]",
"do_basic_tokenize": true,
"do_lower_case": false,
"mask_token": "[MASK]",
"max_len": 128,
"name_or_path": "../../data11/model/sbert/sbert-mdistilbertV2.1-distil",
"never_split": null,
"pad_token": "[PAD]",
"sep_token": "[SEP]",
"special_tokens_map_file": "../../data11/model/distilbert/mdistilbertV2.1-4/special_tokens_map.json",
"strip_accents": false,
"tokenize_chinese_chars": true,
"tokenizer_class": "DistilBertTokenizer",
"unk_token": "[UNK]"
}
Sentence Bert Config
{
"max_seq_length": 256,
"do_lower_case": false
}
Config Sentence Transformers
{
"__version__": {
"sentence_transformers": "2.2.0",
"transformers": "4.21.2",
"pytorch": "1.10.1"
}
}
đ License
No license information provided.
Citing & Authors
bongsoo





