TAACO_STS Open-source Korean Sentence Similarity Model - Free Measurement of Sentence Semantic Coherence

TAACO STS

Developed by KDHyun08

Korean sentence similarity model trained on the Sentence-transformers framework, used to measure semantic coherence between sentences

Text Embedding

Transformers

Korean#Korean sentence similarity #Semantic coherence measurement #KLUE-STS training

Downloads 24

Release Time : 7/25/2022

Model Overview

This model is a component of K-TAACO (Korean Textual Acceptability and Coherence Measurement Tool), specifically designed to calculate semantic similarity between Korean sentences. Trained on KLUE's STS dataset, it is suitable for Korean natural language processing tasks.

Model Features

Korean language optimization

Specially optimized for Korean sentence similarity calculation, suitable for Korean semantic analysis tasks

Semantic coherence measurement

As a core component of the K-TAACO tool, it can effectively measure semantic coherence between sentences

Pre-training + fine-tuning

Fine-tuned based on pre-trained models, performs well on Korean STS tasks

Model Capabilities

Sentence embedding generation

Semantic similarity calculation

Korean text processing

Use Cases

Text analysis

Sentence coherence evaluation

Evaluate the semantic coherence between sentences in a text

Quantifies sentence similarity, helping assess text fluency

Semantic search

Text retrieval based on semantics rather than keywords

Can find semantically similar but differently worded related sentences

Educational technology

Essay scoring assistance

Analyze the logical coherence between sentences in student essays

Provides teachers with objective coherence scoring references

🚀 TAACO_Similarity

This model is based on Sentence-transformers and trained on the STS (Sentence Textual Similarity) dataset of KLUE. It was developed to measure the semantic cohesion between Korean sentences, which is one of the indicators of K-TAACO (tentative name), a tool for measuring the cohesion between Korean sentences developed by the author. Additionally, further training will be carried out using various data such as sentence similarity data from the Modu Corpus.

🚀 Quick Start

✨ Features

Based on Sentence-transformers.
Trained on the KLUE STS dataset.
Used for measuring semantic cohesion between Korean sentences.

📦 Installation

To use this model, you need to install Sentence-transformers:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer, models
sentences = ["This is an example sentence", "Each sentence is converted"]

embedding_model = models.Transformer(
    model_name_or_path="KDHyun08/TAACO_STS", 
    max_seq_length=256,
    do_lower_case=True
)

pooling_model = models.Pooling(
    embedding_model.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True,
    pooling_mode_cls_token=False,
    pooling_mode_max_tokens=False,
)
model = SentenceTransformer(modules=[embedding_model, pooling_model])

embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

After installing Sentence-transformers, you can compare the similarity between sentences as follows. The query variable is the source sentence for comparison, and the sentences to be compared should be organized in a list in docs.

from sentence_transformers import SentenceTransformer, models
import torch
from sentence_transformers import util

docs = ['어제는 아내의 생일이었다', '생일을 맞이하여 아침을 준비하겠다고 오전 8시 30분부터 음식을 준비하였다. 주된 메뉴는 스테이크와 낙지볶음, 미역국, 잡채, 소야 등이었다', '스테이크는 자주 하는 음식이어서 자신이 준비하려고 했다', '앞뒤도 1분씩 3번 뒤집고 래스팅을 잘 하면 육즙이 가득한 스테이크가 준비되다', '아내도 그런 스테이크를 좋아한다. 그런데 상상도 못한 일이 벌이지고 말았다', '보통 시즈닝이 되지 않은 원육을 사서 스테이크를 했는데, 이번에는 시즈닝이 된 부챗살을 구입해서 했다', '그런데 케이스 안에 방부제가 들어있는 것을 인지하지 못하고 방부제와 동시에 프라이팬에 올려놓을 것이다', '그것도 인지 못한 체... 앞면을 센 불에 1분을 굽고 뒤집는 순간 방부제가 함께 구어진 것을 알았다', '아내의 생일이라 맛있게 구워보고 싶었는데 어처구니없는 상황이 발생한 것이다', '방부제가 센 불에 녹아서 그런지 물처럼 흘러내렸다', ' 고민을 했다. 방부제가 묻은 부문만 제거하고 다시 구울까 했는데 방부제에 절대 먹지 말라는 문구가 있어서 아깝지만 버리는 방향을 했다', '너무나 안타까웠다', '아침 일찍 아내가 좋아하는 스테이크를 준비하고 그것을 맛있게 먹는 아내의 모습을 보고 싶었는데 전혀 생각지도 못한 상황이 발생해서... 하지만 정신을 추스르고 바로 다른 메뉴로 변경했다', '소야, 소시지 야채볶음..', '아내가 좋아하는지 모르겠지만 냉장고 안에 있는 후랑크소세지를 보니 바로 소야를 해야겠다는 생각이 들었다. 음식은 성공적으로 완성이 되었다', '40번째를 맞이하는 아내의 생일은 성공적으로 준비가 되었다', '맛있게 먹어 준 아내에게도 감사했다', '매년 아내의 생일에 맞이하면 아침마다 생일을 차려야겠다. 오늘도 즐거운 하루가 되었으면 좋겠다', '생일이니까~']
# Encode the vector values of each sentence
document_embeddings = model.encode(docs)

query = '생일을 맞이하여 아침을 준비하겠다고 오전 8시 30분부터 음식을 준비하였다'
query_embedding = model.encode(query)

top_k = min(10, len(docs))

# Calculate cosine similarity
cos_scores = util.pytorch_cos_sim(query_embedding, document_embeddings)[0]

# Extract sentences in order of cosine similarity
top_results = torch.topk(cos_scores, k=top_k)

print(f"Input sentence: {query}")
print(f"\n<Top {top_k} sentences similar to the input sentence>\n")

for i, (score, idx) in enumerate(zip(top_results[0], top_results[1])):
    print(f"{i+1}: {docs[idx]} (Similarity: {score:.4f})\n")

📚 Documentation

Train Data

KLUE-sts-v1.1._train.json
NLI-sts-train.tsv

Evaluation Results

When you run the above usage examples, the following results will be obtained. The closer the value is to 1, the more similar the sentences are.

Input sentence: 생일을 맞이하여 아침을 준비하겠다고 오전 8시 30분부터 음식을 준비하였다

<Top 10 sentences similar to the input sentence>

1: 생일을 맞이하여 아침을 준비하겠다고 오전 8시 30분부터 음식을 준비하였다. 주된 메뉴는 스테이크와 낙지볶음, 미역국, 잡채, 소야 등이었다 (Similarity: 0.6687)

2: 매년 아내의 생일에 맞이하면 아침마다 생일을 차려야겠다. 오늘도 즐거운 하루가 되었으면 좋겠다 (Similarity: 0.6468)

3: 40번째를 맞이하는 아내의 생일은 성공적으로 준비가 되었다 (Similarity: 0.4647)

4: 아내의 생일이라 맛있게 구워보고 싶었는데 어처구니없는 상황이 발생한 것이다 (Similarity: 0.4469)

5: 생일이니까~ (Similarity: 0.4218)

6: 어제는 아내의 생일이었다 (Similarity: 0.4192)

7: 아침 일찍 아내가 좋아하는 스테이크를 준비하고 그것을 맛있게 먹는 아내의 모습을 보고 싶었는데 전혀 생각지도 못한 상황이 발생해서... 하지만 정신을 추스르고 바로 다른 메뉴로 변경했다 (Similarity: 0.4156)

8: 맛있게 먹어 준 아내에게도 감사했다 (Similarity: 0.3093)

9: 아내가 좋아하는지 모르겠지만 냉장고 안에 있는 후랑크소세지를 보니 바로 소야를 해야겠다는 생각이 들었다. 음식은 성공적으로 완성이 되었다 (Similarity: 0.2259)

10: 아내도 그런 스테이크를 좋아한다. 그런데 상상도 못한 일이 벌이지고 말았다 (Similarity: 0.1967)

🔧 Technical Details

DataLoader

torch.utils.data.dataloader.DataLoader of length 142 with parameters:

{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss

sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss

Parameters of the fit()-Method:

{
    "epochs": 4,
    "evaluation_steps": 1000,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'transformers.optimization.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 10000,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 License

No license information provided in the original document.

Citing & Authors

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご