Moco-SentenceBERT V2.0 Open-Source Model - Supports Korean-English Semantic Calculation and Text Feature Extraction

Moco Sentencebertv2.0

Developed by bongsoo

A sentence embedding model optimized for Korean and English, supporting semantic similarity calculation and text feature extraction

Text Embedding

Transformers

Supports Multiple Languages#Korean-English Bilingual Semantic Matching #Teacher-Student Distillation Optimization #Multi-domain STS Adaptation

Downloads 17

Release Time : 9/19/2022

Model Overview

This model is an improved sentence embedding model based on multilingual BERT, optimized through teacher-student distillation training, suitable for Korean and English sentence similarity calculation, semantic search, and text clustering tasks.

Model Features

Bilingual Optimization

Specially optimized for Korean and English, excelling in semantic understanding tasks for both languages

Knowledge Distillation

Uses paraphrase-multilingual-mpnet-base-v2 as the teacher model for distillation training to enhance model performance

Extended Vocabulary

Added 32,989 new vocabulary items to the original multilingual BERT, totaling 152,537 vocabulary items

Efficient Inference

Supports input lengths of up to 128 tokens, with GPU memory usage of approximately 9GB during inference

Model Capabilities

Sentence embedding generation

Semantic similarity calculation

Text feature extraction

Cross-language semantic matching

Use Cases

Information Retrieval

🚀 moco-sentencebertV2.0

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search.

✨ Features

This model is created by converting the bongsoo/mbertV2.0 MLM model into SentenceBERT and then performing additional STS teacher-student distillation training.
Vocab: 152,537 (32,989 new vocabularies are added to the original 119,548 vocabularies).

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence_transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('bongsoo/moco-sentencebertV2.0')
embeddings = model.encode(sentences)
print(embeddings)

# Use sklearn to calculate cosine_scores
# => The input embeddings should be 2D, like (1,768).
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
cosine_scores = 1 - (paired_cosine_distances(embeddings[0].reshape(1,-1), embeddings[1].reshape(1,-1)))

print(f'*cosine_score:{cosine_scores[0]}')

Outputs

[[ 0.16649279 -0.2933038  -0.00391259 ...  0.00720964  0.18175027  -0.21052675]
 [ 0.10106096 -0.11454111 -0.00378215 ... -0.009032   -0.2111504   -0.15030429]]
*cosine_score:0.3352515697479248

Advanced Usage

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('bongsoo/moco-sentencebertV2.0')
model = AutoModel.from_pretrained('bongsoo/moco-sentencebertV2.0')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

# Use sklearn to calculate cosine_scores
# => The input embeddings should be 2D, like (1,768).
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
cosine_scores = 1 - (paired_cosine_distances(sentence_embeddings[0].reshape(1,-1), sentence_embeddings[1].reshape(1,-1)))

print(f'*cosine_score:{cosine_scores[0]}')

Outputs

Sentence embeddings:
tensor([[ 0.1665, -0.2933, -0.0039,  ...,  0.0072,  0.1818, -0.2105],
        [ 0.1011, -0.1145, -0.0038,  ..., -0.0090, -0.2112, -0.1503]])
*cosine_score:0.3352515697479248

📚 Documentation

Evaluation Results

For performance measurement, the following Korean (kor) and English (en) evaluation corpora are used:
- Korean: korsts (1,379 sentence pairs) and klue-sts (519 sentence pairs).
- English: stsb_multi_mt (1,376 sentence pairs) and glue:stsb (1,500 sentence pairs).
The performance metric is cosin.spearman for comparison.
Refer to the evaluation measurement code here.

Model	korsts	klue-sts	korsts+klue-sts	stsb_multi_mt	glue(stsb)
distiluse-base-multilingual-cased-v2	0.747	0.785	0.577	0.807	0.819
paraphrase-multilingual-mpnet-base-v2	0.820	0.799	0.711	0.868	0.890
bongsoo/sentencedistilbertV1.2	0.819	0.858	0.630	0.837	0.873
bongsoo/moco-sentencedistilbertV2.0	0.812	0.847	0.627	0.837	0.877
bongsoo/moco-sentencebertV2.0	0.824	0.841	0.635	0.843	0.879

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the following parameters:

1. MLM Training

Input model: bert-base-multilingual-cased
Corpus: Training - bongsoo/moco-corpus-kowiki2022 (7.6M), Evaluation - bongsoo/bongevalsmall
Hyperparameters: LearningRate - 5e-5, epochs - 8, batchsize - 32, max_token_len - 128
Output model: mbertV2.0 (size: 813MB)
Training time: 90h/1GPU (24GB/19.6GB use)
Loss: Training loss - 2.258400, Evaluation loss - 3.102096, Perplexity - 19.78158 (bong_eval:1,500)
Refer to the training code here.

2. STS Training => Convert BERT to SentenceBERT.

Input model: mbertV2.0
Corpus: korsts + kluestsV1.1 + stsb_multi_mt + mteb/sickr-sts (total: 33,093)
Hyperparameters: LearningRate - 3e-5, epochs - 200, batchsize - 32, max_token_len - 128
Output model: sbert-mbertV2.0 (size: 813MB)
Training time: 9h20m/1GPU (24GB/9.0GB use)
Loss (cosin_spearman): 0.799 (corpus: korsts(tune_test.tsv))
Refer to the training code here.

3. Distillation Training

Student model: sbert-mbertV2.0
Teacher model: paraphrase-multilingual-mpnet-base-v2
Corpus: en_ko_train.tsv (Korean-English social science parallel corpus: 1.1M)
Hyperparameters: LearningRate - 5e-5, epochs - 40, batchsize - 128, max_token_len - 128
Output model: sbert-mlbertV2.0-distil
Training time: 17h/1GPU (24GB/18.6GB use)
Refer to the training code here.

4. STS Training => Train the SentenceBERT model with STS.

Input model: sbert-mlbertV2.0-distil
Corpus: korsts (5,749) + kluestsV1.1 (11,668) + stsb_multi_mt (5,749) + mteb/sickr-sts (9,927) + glue stsb (5,749) (total: 38,842)
Hyperparameters: LearningRate - 3e-5, epochs - 800, batchsize - 64, max_token_len - 128
Output model: moco-sentencebertV2.0
Training time: 25h/1GPU (24GB/13GB use)
Refer to the training code here.

Refer to the detailed content of the model production process here.

DataLoader: torch.utils.data.dataloader.DataLoader of length 1035 with parameters:

{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Config:

{
  "_name_or_path": "../../data11/model/sbert/sbert-mbertV2.0-distil",
  "architectures": [
    "BertModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.21.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 152537
}

🔧 Technical Details

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 License

No license information provided in the original document.

Citing & Authors

bongsoo

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご