Open-source Model moco-sentencedistilbertV2.0 - Supports Korean-English Bilingual Semantic Search and Clustering Tasks

Moco Sentencedistilbertv2.0

Developed by bongsoo

This is a Korean-English bilingual sentence embedding model based on sentence-transformers, which maps sentences to a 768-dimensional vector space, suitable for semantic search and clustering tasks.

Text Embedding

Transformers

Supports Multiple Languages#Korean-English Bilingual #Sentence Similarity #Semantic Search

Downloads 39

Release Time : 9/5/2022

Model Overview

This model is improved upon mdistilbertV1.1, trained on a 3.2M-sentence moco-corpus through STS teacher-student distillation, supporting sentence similarity calculation in Korean and English.

Model Features

Bilingual Support

Supports sentence embeddings for both Korean and English

Efficient Distillation

Improves model performance through teacher-student distillation training

Large-scale Training

Trained on a 3.2M-sentence moco-corpus

Optimized Vocabulary

Vocabulary expanded to 164,314 words, adding 17,870 new words compared to the original model

Model Capabilities

Sentence Embedding

Semantic Similarity Calculation

Text Clustering

Cross-language Retrieval

Use Cases

Information Retrieval

Cross-language Document Retrieval

Finding semantically similar documents in mixed Korean and English document libraries

Effectively identifies semantically similar documents across different languages

Q&A Systems

Question Matching

Matching user questions with similar questions in the knowledge base

As shown in the example, accurately identifies the semantic similarity between 'What is the capital of Korea?' and 'Seoul is the capital of Korea'

Content Recommendation

🚀 moco-sentencedistilbertV2.0

This is a model that maps sentences & paragraphs to a 768-dimensional dense vector space, suitable for tasks like clustering or semantic search.

Pipeline and Tags

Pipeline Tag: sentence-similarity
Tags: sentence-transformers, feature-extraction, sentence-similarity, transformers, ko, en

Widget Example

Source Sentence: "대한민국의 수도는?"
Comparison Sentences:
- "서울특별시는 한국이 정치,경제,문화 중심 도시이다."
- "부산은 대한민국의 제2의 도시이자 최대의 해양 물류 도시이다."
- "제주도는 대한민국에서 유명한 관광지이다"
- "Seoul is the capital of Korea"
- "울산광역시는 대한민국 남동부 해안에 있는 광역시이다"

🚀 Quick Start

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

This model is created by first training the mdistilbertV1.1 model with the moco-corpus (3.2M sentences extracted by MOCOMSYS) using SentenceBERT, and then performing additional STS teacher-student distillation training.
Vocab: 164,314 (17,870 new vocab added to the original mdistilbertV1.1 vocab of 146,444).
MLM Model: bongsoo/mdistilbertV2.0

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence_transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('bongsoo/moco-sentencedistilbertV2.0')
embeddings = model.encode(sentences)
print(embeddings)

# Use sklearn to calculate cosine_scores
# => The input embeddings should be 2D like (1,768).
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
cosine_scores = 1 - (paired_cosine_distances(embeddings[0].reshape(1,-1), embeddings[1].reshape(1,-1)))

print(f'*cosine_score:{cosine_scores[0]}')

Outputs

[[ 9.7172342e-02 -3.3226651e-01 -7.7130608e-05 ...  1.3900512e-02 2.1072578e-01 -1.5386048e-01]
 [ 2.3313640e-02 -8.4675789e-02 -3.7715461e-06 ...  2.4005771e-02 -1.6602692e-01 -1.2729791e-01]]
*cosine_score:0.3383665680885315

Advanced Usage

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('bongsoo/moco-sentencedistilbertV2.0')
model = AutoModel.from_pretrained('bongsoo/moco-sentencedistilbertV2.0')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

# Use sklearn to calculate cosine_scores
# => The input embeddings should be 2D like (1,768).
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
cosine_scores = 1 - (paired_cosine_distances(sentence_embeddings[0].reshape(1,-1), sentence_embeddings[1].reshape(1,-1)))

print(f'*cosine_score:{cosine_scores[0]}')

Outputs

Sentence embeddings:
tensor([[ 9.7172e-02, -3.3227e-01, -7.7131e-05,  ...,  1.3901e-02, 2.1073e-01, -1.5386e-01],
        [ 2.3314e-02, -8.4676e-02, -3.7715e-06,  ...,  2.4006e-02, -1.6603e-01, -1.2730e-01]])
*cosine_score:0.3383665680885315

📚 Documentation

Evaluation Results

The corpora used for performance measurement are the following Korean (kor) and English (en) evaluation corpora:
- Korean: korsts (1,379 sentence pairs) and klue-sts (519 sentence pairs)
- English: stsb_multi_mt (1,376 sentence pairs)
The performance metric is cosin.spearman for comparison.
Refer to the evaluation code here.

Model	korsts	klue-sts	korsts+klue-sts	stsb_multi_mt
bongsoo/sentencedistilbertV1.2	0.819	0.858	0.630	0.837
distiluse-base-multilingual-cased-v2	0.747	0.785	0.577	0.807
paraphrase-multilingual-mpnet-base-v2	0.820	0.799	0.711	0.868
bongsoo/moco-sentencedistilbertV2.0	0.812	0.847	0.627	0.837

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the following parameters:

1. MLM Training

Input Model: bongsoo/mdistilbertV1.1 (distilbert-base-multilingual-cased trained on the kowiki20220620 (4.4M) corpus)
Corpus: nlp_corpus (3.2M): A refined corpus from MOCOMSYS files
Hyperparameters: Learning Rate: 5e-5, Epochs: 8, Batch Size: 32, Max Token Length: 128
Output Model: mdistilbertV2.0
Training Time: 27h
Refer to the training code here

2. STS Training

Convert DistilBERT to SentenceBERT.
Input Model: mdistilbertV2.0
Corpus: korsts + kluestsV1.1 + stsb_multi_mt + mteb/sickr-sts (Total: 33,093)
Hyperparameters: Learning Rate: 2e-5, Epochs: 200, Batch Size: 32, Max Token Length: 128
Output Model: sbert-mdistilbertV2.0
Training Time: 5h
Refer to the training code here

3. Distillation Training

Student Model: sbert-mdistilbertV2.0
Teacher Model: paraphrase-multilingual-mpnet-base-v2
Corpus: en_ko_train.tsv (A parallel corpus of Korean-English social science fields: 1.1M)
Hyperparameters: Learning Rate: 5e-5, Epochs: 40, Batch Size: 32, Max Token Length: 128
Output Model: sbert-mdistilbertV2.0.2-distil
Training Time: 11h
Refer to the training code here

4. STS Training

Train the SentenceBERT model on STS.
Input Model: sbert-mdistilbertV2.0.2-distil
Corpus: korsts + kluestsV1.1 + stsb_multi_mt + mteb/sickr-sts (Total: 33,093)
Hyperparameters: Learning Rate: 3e-5, Epochs: 800, Batch Size: 32, Max Token Length: 128
Output Model: moco-sentencedistilbertV2.0
Training Time: 15h
Refer to the training code here

For more details about the model creation process, refer to here.

DataLoader:

torch.utils.data.dataloader.DataLoader of length 1035 with parameters:

{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Config:

{
  "_name_or_path": "../../data11/model/sbert/sbert-mdistilbertV2.0.2-distil",
  "activation": "gelu",
  "architectures": [
    "DistilBertModel"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.21.2",
  "vocab_size": 164314
}

🔧 Technical Details

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 License

This model is developed by bongsoo.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご