The open-source sentence transformer model kpf - sbert - v1.1 is free for clustering and semantic search tasks.

Kpf Sbert V1.1

Developed by bongsoo

This is a sentence transformer model fine-tuned from KPFBERT using SentenceBERT, which maps sentences and paragraphs into a 768-dimensional vector space, suitable for clustering or semantic search tasks.

Text Embedding

Transformers

#Korean-English Bilingual Semantic Similarity #High-precision Sentence Embedding #Multi-task Distillation Training

Downloads 46

Release Time : 1/13/2023

Model Overview

This model is a SentenceBERT-fine-tuned version based on jinmang2/kpfbert, optimized through multiple rounds of training, demonstrating excellent performance in Korean and English sentence similarity tasks.

Model Features

Multilingual Support

Supports sentence embeddings in Korean and English, with excellent performance in similarity tasks for both languages.

High Performance

Achieves a Spearman correlation coefficient of 0.8750 on Korean datasets such as korsts and klue-sts, outperforming similar multilingual models.

Multi-stage Training

Employs an alternating training strategy of STS-Distillation-NLI to enhance model performance through multi-stage optimization.

Model Capabilities

Sentence Embedding

Semantic Similarity Calculation

Text Clustering

Semantic Search

Use Cases

Text Similarity

Korean Sentence Similarity Calculation

Calculate the semantic similarity between two Korean sentences.

Achieves a Spearman correlation coefficient of 0.8750 on the korsts dataset.

Cross-lingual Retrieval

Supports cross-lingual semantic search between Korean and English.

Achieves a correlation coefficient of 0.8554 on the stsb_multi_mt English dataset.

Information Retrieval

Semantic Search

Document retrieval system based on semantics rather than keyword matching.

🚀 kpf-sbert-v1.1

This is a sentence-transformers model. It maps sentences and paragraphs to a 768-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.

This model is fine - tuned from the jinmang2/kpfbert model using SentenceBERT. (One more round of NLI - STS training was conducted on kpf - sbert - v1.)

🚀 Quick Start

This model can be used directly for tasks like clustering or semantic search by mapping sentences and paragraphs into a 768 - dimensional dense vector space.

✨ Features

Maps sentences and paragraphs to a 768 - dimensional dense vector space.
Suitable for tasks such as clustering and semantic search.

📚 Documentation

Evaluation Results

For performance measurement, the following Korean (kor) and English (en) evaluation corpora were used:
- Korean: korsts (1,379 sentence pairs) and klue - sts (519 sentence pairs)
- English: stsb_multi_mt (1,376 sentence pairs) and glue:stsb (1,500 sentence pairs)
The performance indicator is cosin.spearman.
Refer to the evaluation measurement code [here](https://github.com/kobongsoo/BERT/blob/master/sbert/sbert - test3.ipynb).

Model	korsts	klue - sts	glue(stsb)	stsb_multi_mt(en)
distiluse - base - multilingual - cased - v2	0.7475	0.7855	0.8193	0.8075
paraphrase - multilingual - mpnet - base - v2	0.8201	0.7993	0.8907	0.8682
bongsoo/albert - small - kor - sbert - v1	0.8305	0.8588	0.8419	0.7965
bongsoo/klue - sbert - v1.0	0.8529	0.8952	0.8813	0.8469
bongsoo/kpf - sbert - v1.0	0.8590	0.8924	0.8840	0.8531
bongsoo/kpf - sbert - v1.1	0.8750	0.8900	0.8863	0.8554

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The jinmang2/kpfbert model was trained with the following steps: sts(10) - distil(10) - nli(3) - sts(10) - nli(3) - sts(10)

The model was trained with the following parameters:

Common

do_lower_case = 1, correct_bios = 0, polling_mode = mean

1. STS

Corpus: korsts(5,749) + kluestsV1.1(11,668) + stsb_multi_mt(5,749) + mteb/sickr - sts(9,927) + glue stsb(5,749) (Total: 38,842)
Parameters: lr: 1e - 4, eps: 1e - 6, warm_step = 10%, epochs: 10, train_batch: 128, eval_batch: 64, max_token_len: 72
Refer to the training code [here](https://github.com/kobongsoo/BERT/blob/master/sbert/sentece - bert - sts.ipynb).

2. Distillation

Teacher model: paraphrase - multilingual - mpnet - base - v2 (max_token_len: 128)
Corpus: news_talk_en_ko_train.tsv (English - Korean dialogue - news parallel corpus: 1.38M)
Parameters: lr: 5e - 5, eps: 1e - 8, epochs: 10, train_batch: 128, eval/test_batch: 64, max_token_len: 128 (to match the teacher model)
Refer to the training code [here](https://github.com/kobongsoo/BERT/blob/master/sbert/sbert - distillaton.ipynb).

3. NLI

Corpus: Training (967,852): kornli(550,152), kluenli(24,998), glue - mnli(392,702); Evaluation (3,519): korsts(1,500), kluests(519), gluests(1,500)
Hyperparameters: lr: 3e - 5, eps: 1e - 8, warm_step = 10%, epochs: 3, train/eval_batch: 64, max_token_len: 128
Refer to the training code [here](https://github.com/kobongsoo/BERT/blob/master/sbert/sentence - bert - nli.ipynb).

📄 License

No license information provided.

Citing & Authors

bongsoo

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご