ko-sbert-sts Open-source Korean Sentence Embedding Model - Free for Clustering and Semantic Search Tasks

Ko Sbert Sts

Developed by jhgan

This is a Korean sentence embedding model based on sentence-transformers, capable of mapping sentences and paragraphs into a 768-dimensional dense vector space, suitable for tasks such as clustering or semantic search.

Text Embedding #Korean sentence embedding #Semantic similarity calculation #768-dimensional vector space

Downloads 175.93k

Release Time : 3/2/2022

Model Overview

This model is specifically designed for Korean text and can convert sentences and paragraphs into high-dimensional vector representations, suitable for natural language processing tasks such as sentence similarity calculation, semantic search, and text clustering.

Model Features

Korean language optimization

Specially optimized for Korean text, better handling the semantic features of Korean sentences.

High-dimensional vector representation

Maps text into a 768-dimensional dense vector space, preserving rich semantic information.

Sentence similarity calculation

Particularly suitable for calculating semantic similarity between sentences.

Model Capabilities

Sentence embedding

Semantic similarity calculation

Text clustering

Semantic search

Use Cases

Information retrieval

Semantic search system

Build a search system based on semantics rather than keywords

Improves the accuracy and relevance of search results

Text analysis

Document clustering

Automatically group semantically similar documents

Achieves unsupervised document classification

🚀 ko-sbert-sts

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.

🚀 Quick Start

This model can be used for sentence similarity tasks. It effectively maps sentences and paragraphs into a 768-dimensional dense vector space, facilitating tasks like clustering and semantic search.

✨ Features

Sentence-Transformers Integration: Works seamlessly with the sentence-transformers library.
Feature Extraction: Capable of extracting features from sentences and paragraphs.
Sentence Similarity: Can be used to calculate the similarity between sentences.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["안녕하세요?", "한국어 문장 임베딩을 위한 버트 모델입니다."]

model = SentenceTransformer('jhgan/ko-sbert-sts')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('jhgan/ko-sbert-sts')
model = AutoModel.from_pretrained('jhgan/ko-sbert-sts')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

This is the result of evaluating the model on the KorSTS evaluation dataset after training on the KorSTS training dataset.

Similarity Measure	Pearson	Spearman
Cosine	81.55	81.23
Euclidean	79.94	79.79
Manhattan	79.90	79.75
Dot	76.02	75.31

Training

The model was trained with the following parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 719 with parameters:

{'batch_size': 8, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss

Parameters of the fit()-Method:

{
    "epochs": 5,
    "evaluation_steps": 1000,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'transformers.optimization.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 360,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Citing & Authors

Ham, J., Choe, Y. J., Park, K., Choi, I., & Soh, H. (2020). Kornli and korsts: New benchmark datasets for korean natural language understanding. arXiv preprint arXiv:2004.03289
Reimers, Nils and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” ArXiv abs/1908.10084 (2019)
Reimers, Nils and Iryna Gurevych. “Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation.” EMNLP (2020)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご