roberta-ko-small-tsdae Open Source Korean Model - Free Deployment to Aid Sentence Clustering and Semantic Search

Roberta Ko Small Tsdae

Developed by smartmind

This is a Korean small RoBERTa model based on sentence-transformers, capable of mapping sentences and paragraphs into a 256-dimensional dense vector space, suitable for tasks such as clustering or semantic search.

Text Embedding

Transformers

KoreanOpen Source License:MIT #Korean sentence embeddings #Small RoBERTa #Unsupervised training

Downloads 39

Release Time : 9/19/2022

Model Overview

The model adopts the TSDAE pre-training method, with the same architecture as lassl/roberta-ko-small but uses a different tokenizer. It can be directly used for calculating sentence similarity or fine-tuned for specific tasks.

Model Features

TSDAE Pre-training

Utilizes TSDAE (Transformer-based Sequential Denoising Auto-Encoder) pre-training method, enhancing the model's semantic understanding capabilities.

256-dimensional dense vectors

Can map sentences and paragraphs into a 256-dimensional dense vector space, facilitating subsequent similarity calculations and clustering analysis.

Korean optimization

A model specifically optimized for Korean, using a Korean-specific tokenizer.

Lightweight

A small RoBERTa model with lower computational resource requirements.

Model Capabilities

Sentence vectorization

Semantic similarity calculation

Text clustering

Semantic search

Use Cases

Information retrieval

🚀 smartmind/roberta-ko-small-tsdae

This is a sentence-transformers model that maps sentences and paragraphs to a 256-dimensional dense vector space. It can be used for tasks such as clustering or semantic search. It's a Korean roberta small model pretrained with TSDAE. The model can be directly used for calculating sentence similarity or fine-tuned according to specific needs.

🚀 Quick Start

✨ Features

Maps sentences and paragraphs to a 256-dimensional dense vector space.
Can be used for clustering or semantic search.
Can be directly used for sentence similarity calculation or fine-tuned.

📦 Installation

To use this model, you need to install sentence-transformers. You can install it using the following command:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

Usage with Sentence-Transformers

After installing sentence-transformers, you can directly load the model as follows:

from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('smartmind/roberta-ko-small-tsdae')
embeddings = model.encode(sentences)
print(embeddings)

The following is an example of calculating the similarity of multiple sentences using the functions of sentence-transformers:

from sentence_transformers import util

sentences = [
    "대한민국의 수도는 서울입니다.",
    "미국의 수도는 뉴욕이 아닙니다.",
    "대한민국의 수도 요금은 저렴한 편입니다.",
    "서울은 대한민국의 수도입니다.",
    "오늘 서울은 하루종일 맑음",
]

paraphrase = util.paraphrase_mining(model, sentences)
for score, i, j in paraphrase:
    print(f"{sentences[i]}\t\t{sentences[j]}\t\t{score:.4f}")

대한민국의 수도는 서울입니다.		서울은 대한민국의 수도입니다.		0.7616
대한민국의 수도는 서울입니다.		미국의 수도는 뉴욕이 아닙니다.		0.7031
대한민국의 수도는 서울입니다.		대한민국의 수도 요금은 저렴한 편입니다.		0.6594
미국의 수도는 뉴욕이 아닙니다.		서울은 대한민국의 수도입니다.		0.6445
대한민국의 수도 요금은 저렴한 편입니다.		서울은 대한민국의 수도입니다.		0.4915
미국의 수도는 뉴욕이 아닙니다.		대한민국의 수도 요금은 저렴한 편입니다.		0.4785
서울은 대한민국의 수도입니다.		오늘 서울은 하루종일 맑음		0.4119
대한민국의 수도는 서울입니다.		오늘 서울은 하루종일 맑음		0.3520
미국의 수도는 뉴욕이 아닙니다.		오늘 서울은 하루종일 맑음		0.2550
대한민국의 수도 요금은 저렴한 편입니다.		오늘 서울은 하루종일 맑음		0.1896

Usage without Sentence-Transformers

If you don't install sentence-transformers, you can use the model as follows:

from transformers import AutoTokenizer, AutoModel
import torch


def cls_pooling(model_output, attention_mask):
    return model_output[0][:,0]


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('smartmind/roberta-ko-small-tsdae')
model = AutoModel.from_pretrained('smartmind/roberta-ko-small-tsdae')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

The following scores were obtained on the klue STS dataset. These scores were obtained without fine-tuning on this dataset.

Split	Cosine Pearson	Cosine Spearman	Euclidean Pearson	Euclidean Spearman	Manhattan Pearson	Manhattan Spearman	Dot Pearson	Dot Spearman
Train	0.8735	0.8676	0.8268	0.8357	0.8248	0.8336	0.8449	0.8383
Validation	0.5409	0.5349	0.4786	0.4657	0.4775	0.4625	0.5284	0.5252

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 508, 'do_lower_case': False}) with Transformer model: RobertaModel
  (1): Pooling({'word_embedding_dimension': 256, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご