KR-SBERT-V40K-klueNLI-augSTS Open-source Korean Model - Supports Sentence and Paragraph Clustering and Semantic Search

KR SBERT V40K Kluenli Augsts

Developed by snunlp

This is a Korean sentence embedding model based on sentence-transformers, capable of mapping sentences and paragraphs into a 768-dimensional dense vector space, suitable for tasks such as clustering or semantic search.

Text Embedding

Transformers

Korean#Korean sentence similarity #768-dimensional vector embedding #NLI-STS optimized

Downloads 500.73k

Release Time : 5/3/2022

Model Overview

This model is a sentence transformer specifically optimized for Korean, achieving high-quality sentence embedding representations through pre-training and fine-tuning, supporting natural language processing tasks such as sentence similarity calculation and semantic search.

Model Features

Korean optimized

Specifically optimized for Korean text, better handling the semantic features of Korean sentences

High-quality embeddings

Generates 768-dimensional dense vector representations, effectively capturing sentence semantic information

Multi-task training

Trained on klueNLI and augSTS datasets, enhancing the model's generalization capability

Model Capabilities

Sentence embedding representation

Semantic similarity calculation

Text clustering

Semantic search

Use Cases

Text similarity

Restaurant review analysis

Analyze user reviews of restaurants to find semantically similar comments

Accurately identifies similar comments regarding restaurant hygiene issues

Document classification

News classification

Use sentence embeddings to classify Korean news articles

Achieves a classification accuracy of 86.28%

🚀 snunlp/KR-SBERT-V40K-klueNLI-augSTS

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search.

🚀 Quick Start

This section will guide you through the basic usage of the snunlp/KR-SBERT-V40K-klueNLI-augSTS model.

✨ Features

Maps sentences and paragraphs to a 768-dimensional dense vector space.
Suitable for tasks like clustering and semantic search.

📦 Installation

To use this model, you need to install the sentence-transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

If you have sentence-transformers installed, you can use the model as follows:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('snunlp/KR-SBERT-V40K-klueNLI-augSTS')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model by passing your input through the transformer model and then applying the right pooling operation on top of the contextualized word embeddings:

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('snunlp/KR-SBERT-V40K-klueNLI-augSTS')
model = AutoModel.from_pretrained('snunlp/KR-SBERT-V40K-klueNLI-augSTS')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Application for document classification

Tutorial in Google Colab: https://colab.research.google.com/drive/1S6WSjOx9h6Wh_rX1Z2UXwx9i_uHLlOiM

Model	Accuracy
KR-SBERT-Medium-NLI-STS	0.8400
KR-SBERT-V40K-NLI-STS	0.8400
KR-SBERT-V40K-NLI-augSTS	0.8511
KR-SBERT-V40K-klueNLI-augSTS	0.8628

📄 License

Citation

@misc{kr-sbert,
  author = {Park, Suzi and Hyopil Shin},
  title = {KR-SBERT: A Pre-trained Korean-specific Sentence-BERT model},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snunlp/KR-SBERT}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご