KR-SBERT-V40K-klueNLI-augSTS-ft Open-source Korean Model - Efficient Sentence Similarity Calculation

KR SBERT V40K Kluenli Augsts Ft

Developed by marigold334

This is a Korean sentence embedding model fine-tuned based on the KR-SBERT model optimized by SNUNLP Lab, focusing on sentence similarity calculation tasks.

Text Embedding

Transformers

Korean#Korean sentence similarity #NLI fine-tuning optimization #Multi-round negative sampling training

Downloads 119

Release Time : 11/28/2023

Model Overview

This model is a fine-tuned version of the KR-SBERT model, primarily used for generating high-quality sentence embeddings, especially suitable for Korean sentence similarity calculation tasks.

Model Features

Korean language optimization

Specially optimized for Korean, capable of better handling the semantic features of Korean sentences.

Fine-tuning enhancement

Fine-tuned based on the original KR-SBERT model, improving performance for specific tasks.

Efficient embeddings

Capable of quickly generating high-quality sentence embeddings.

Model Capabilities

Sentence embedding generation

Sentence similarity calculation

Semantic feature extraction

Use Cases

Semantic search

Restaurant review analysis

Analyze user reviews of restaurants to find semantically similar reviews.

Can effectively categorize reviews describing similar dining experiences.

Text matching

Q&A system

Match user questions with candidate answers in the knowledge base.

Improves the accuracy and response speed of the Q&A system.

🚀 marigold334/KR-SBERT-V40K-klueNLI-augSTS-ft

This is a version of KR-SBERT that has been fine-tuned by the SNUNLP lab using fine-tuning. It is designed for sentence similarity tasks.

Pipeline and Tags

Pipeline Tag: sentence-similarity
Tags: sentence-transformers, feature-extraction, sentence-similarity, transformers

Language

Korean (ko)

Widget Examples

Example 1: Restaurant
- Source Sentence: "그 식당은 파리를 날린다"
- Comparison Sentences:
  - "그 식당에는 손님이 없다"
  - "그 식당에서는 드론을 날린다"
  - "파리가 식당에 날아다닌다"
Example 2: Sleepy
- Source Sentence: "잠이 옵니다"
- Comparison Sentences:
  - "잠이 안 옵니다"
  - "졸음이 옵니다"
  - "기차가 옵니다"

🚀 Quick Start

✨ Features

This model is a fine - tuned version of KR-SBERT, which can effectively perform sentence similarity tasks.

📦 Installation

To use this model, you need to install sentence-transformers:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage (Sentence-Transformers)

If you have sentence-transformers installed, using this model is straightforward:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('snunlp/KR-SBERT-V40K-klueNLI-augSTS-ft')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model as follows: First, pass your input through the transformer model, then apply the appropriate pooling operation on top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('snunlp/KR-SBERT-V40K-klueNLI-augSTS')
model = AutoModel.from_pretrained('snunlp/KR-SBERT-V40K-klueNLI-augSTS-ft')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

🔧 Technical Details

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご