albert-small-kor-sbert-v1 Open-Source Model - Achieve Sentence and Paragraph Mapping, Assist in Clustering and Semantic Search

Albert Small Kor Sbert V1

Developed by bongsoo

A SentenceBERT version based on the albert-small-kor-v1 model, designed to map sentences and paragraphs into a 768-dimensional dense vector space, suitable for tasks like clustering or semantic search.

Text Embedding

Transformers

#Korean Sentence Embedding #Multilingual Similarity Calculation #ALBERT Lightweight Optimization

Downloads 128

Release Time : 1/11/2023

Model Overview

This is a sentence-transformers model specifically designed for generating dense vector representations of sentences and paragraphs, supporting Korean and English.

Model Features

Multilingual Support

Supports generating sentence embeddings for both Korean and English.

Efficient Training

Optimized model performance through three training stages: STS, distillation, and NLI.

High-Dimensional Vector Space

Maps sentences and paragraphs into a 768-dimensional dense vector space, suitable for various downstream tasks.

Model Capabilities

Sentence Embedding Generation

Semantic Search

Text Clustering

Sentence Similarity Calculation

Use Cases

Semantic Search

Document Retrieval

Used to retrieve documents semantically similar to the query sentence.

High-accuracy semantic matching.

Text Clustering

News Classification

Clusters similar news articles together.

Efficient text grouping.

🚀 albert-small-kor-sbert-v1

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search. This model is developed based on the albert-small-kor-v1 model using the SentenceBERT approach.

🚀 Quick Start

📦 Installation

You can install the necessary library with the following command:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('bongsoo/albert-small-kor-sbert-v1')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model as follows. First, pass your input through the transformer model, then apply the appropriate pooling operation on top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


def cls_pooling(model_output, attention_mask):
    return model_output[0][:,0]


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('bongsoo/albert-small-kor-sbert-v1')
model = AutoModel.from_pretrained('bongsoo/albert-small-kor-sbert-v1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

For performance measurement, the following Korean (kor) and English (en) evaluation corpora were used:
- Korean: korsts (1,379 sentence pairs) and klue-sts (519 sentence pairs)
- English: stsb_multi_mt (1,376 sentence pairs) and glue:stsb (1,500 sentence pairs)
The performance metric is cosin.spearman.
Refer to the evaluation code here.

Model	korsts	klue-sts	glue(stsb)	stsb_multi_mt(en)
distiluse-base-multilingual-cased-v2	0.7475	0.7855	0.8193	0.8075
paraphrase-multilingual-mpnet-base-v2	0.8201	0.7993	0.8907	0.8682
bongsoo/moco-sentencedistilbertV2.1	0.8390	0.8767	0.8805	0.8548
bongsoo/albert-small-kor-sbert-v1	0.8305	0.8588	0.8419	0.7965

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The albert-small-kor-v1 model was trained using the sts(10)-distil(10)-nli(3)-sts(10) approach.

Common Parameters

do_lower_case=1, correct_bios=0, polling_mode=cls

1. STS

Corpus: korsts (5,749) + kluestsV1.1 (11,668) + stsb_multi_mt (5,749) + mteb/sickr-sts (9,927) + glue stsb (5,749) (Total: 38,842)
Parameters: lr: 1e-4, eps: 1e-6, warm_step=10%, epochs: 10, train_batch: 32, eval_batch: 64, max_token_len: 72
Refer to the training code here.

2. Distillation

Teacher model: paraphrase-multilingual-mpnet-base-v2 (max_token_len: 128)
Corpus: news_talk_en_ko_train.tsv (English-Korean dialogue-news parallel corpus: 1.38M)
Parameters: lr: 5e-5, eps: 1e-8, epochs: 10, train_batch: 32, eval/test_batch: 64, max_token_len: 128 (to match the teacher model)
Refer to the training code here.

3. NLI

Corpus: Training (967,852): kornli (550,152), kluenli (24,998), glue-mnli (392,702); Evaluation (3,519): korsts (1,500), kluests (519), gluests (1,500)
Hyperparameters: lr: 3e-5, eps: 1e-8, warm_step=10%, epochs: 3, train/eval_batch: 64, max_token_len: 128
Refer to the training code here.

🔧 Technical Details

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': True}) with Transformer model: AlbertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 License

No license information provided.

Citing & Authors

bongsoo

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご