Open-source Korean Sentence Embedding Model ko-sroberta-multitask - Empowering Clustering and Semantic Search Tasks

Ko Sroberta Multitask

Developed by jhgan

This is a Korean sentence embedding model based on sentence-transformers, capable of mapping sentences and paragraphs into a 768-dimensional dense vector space, suitable for tasks such as clustering or semantic search.

Text Embedding Korean#Korean sentence embedding #Multi-task learning #Semantic similarity

Downloads 162.23k

Release Time : 3/2/2022

Model Overview

This model is based on the RoBERTa architecture, trained with multi-task learning, specifically designed for Korean sentence embedding representation, supporting sentence similarity calculation and feature extraction.

Model Features

Multi-task learning

The model is trained using KorSTS and KorNLI datasets for multi-task learning, improving the quality of sentence embeddings.

Efficient semantic representation

Capable of efficiently mapping sentences and paragraphs into a 768-dimensional dense vector space while preserving semantic information.

Korean optimization

Specially optimized for Korean, suitable for Korean sentence embedding and similarity calculation.

Model Capabilities

Sentence embedding

Semantic search

Text clustering

Sentence similarity calculation

Use Cases

Natural Language Processing

Semantic search

Use sentence embeddings for efficient semantic search to find documents or paragraphs semantically similar to the query sentence.

Text clustering

Cluster large amounts of Korean text into groups with similar semantics for text classification or information organization.

🚀 ko-sroberta-multitask

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search.

🚀 Quick Start

✨ Features

Maps sentences and paragraphs to a 768-dimensional dense vector space.
Suitable for clustering and semantic search tasks.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["안녕하세요?", "한국어 문장 임베딩을 위한 버트 모델입니다."]

model = SentenceTransformer('jhgan/ko-sroberta-multitask')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('jhgan/ko-sroberta-multitask')
model = AutoModel.from_pretrained('jhgan/ko-sroberta-multitask')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

This is the result of evaluating the model on the KorSTS evaluation dataset after multi-task learning with the KorSTS and KorNLI training datasets.

Property	Details
Cosine Pearson	84.77
Cosine Spearman	85.60
Euclidean Pearson	83.71
Euclidean Spearman	84.40
Manhattan Pearson	83.70
Manhattan Spearman	84.38
Dot Pearson	82.42
Dot Spearman	82.33

Training

The model was trained with the following parameters:

DataLoader: sentence_transformers.datasets.NoDuplicatesDataLoader.NoDuplicatesDataLoader of length 8885 with parameters:

{'batch_size': 64}

Loss: sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}

DataLoader: torch.utils.data.dataloader.DataLoader of length 719 with parameters:

{'batch_size': 8, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss

Parameters of the fit()-Method:

{
    "epochs": 5,
    "evaluation_steps": 1000,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'transformers.optimization.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 360,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 License

No license information provided in the original document.

Citing & Authors

Ham, J., Choe, Y. J., Park, K., Choi, I., & Soh, H. (2020). Kornli and korsts: New benchmark datasets for korean natural language understanding. arXiv preprint arXiv:2004.03289
Reimers, Nils and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” ArXiv abs/1908.10084 (2019)
Reimers, Nils and Iryna Gurevych. “Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation.” EMNLP (2020).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご