KURE V1
KURE-v1 is an embedding model specifically optimized for Korean text retrieval, fine-tuned based on BAAI/bge-m3, and excels in Korean retrieval tasks.
Downloads 27.44k
Release Time : 12/18/2024
Model Overview
This model performs exceptionally well in Korean text retrieval and is one of the best publicly available Korean retrieval models. It supports both Korean and English, making it suitable for information retrieval and similarity calculation tasks.
Model Features
Optimized Korean retrieval performance
Specially optimized for Korean text retrieval tasks, significantly outperforming most multilingual embedding models
Large sequence length support
Supports sequence lengths up to 8192, suitable for long document retrieval tasks
Efficient training method
Trained using cached GIST embedding loss with a batch size of up to 4096, ensuring high training efficiency
Model Capabilities
Korean text embedding
Cross-language retrieval (Korean-English)
Long document processing
Sentence similarity calculation
Use Cases
Information retrieval
Korean document retrieval system
Build an efficient Korean search engine to quickly retrieve relevant documents
Performs excellently on multiple Korean retrieval benchmarks
Question answering systems
Korean open-domain QA
Used as the document retrieval component in question-answering systems
Performs well on datasets such as Ko-StrategyQA
๐ ๐ KURE-v1
Introducing the Korea University Retrieval Embedding model, KURE-v1. It has demonstrated remarkable performance in Korean text retrieval, specifically outperforming most multilingual embedding models. To our knowledge, it is one of the best publicly available Korean retrieval models.
For details, visit the KURE repository
๐ Quick Start
The KURE-v1 model offers excellent performance in Korean text retrieval. You can quickly start using it following the steps below.
โจ Features
- High Performance: It has shown remarkable performance in Korean text retrieval, overwhelming most multilingual embedding models.
- Publicly Available: It is one of the best publicly opened Korean retrieval models.
๐ฆ Installation
Install Dependencies
First, install the Sentence Transformers library:
pip install -U sentence-transformers
๐ป Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
# Download from the ๐ค Hub
model = SentenceTransformer("nlpai-lab/KURE-v1")
# Run inference
sentences = [
'ํ๋ฒ๊ณผ ๋ฒ์์กฐ์ง๋ฒ์ ์ด๋ค ๋ฐฉ์์ ํตํด ๊ธฐ๋ณธ๊ถ ๋ณด์ฅ ๋ฑ์ ๋ค์ํ ๋ฒ์ ๋ชจ์์ ๊ฐ๋ฅํ๊ฒ ํ์ด',
'4. ์์ฌ์ ๊ณผ ๊ฐ์ ๋ฐฉํฅ ์์ ์ดํด๋ณธ ๋ฐ์ ๊ฐ์ด ์ฐ๋ฆฌ ํ๋ฒ๊ณผ ๏ฝข๋ฒ์์กฐ์ง ๋ฒ๏ฝฃ์ ๋๋ฒ์ ๊ตฌ์ฑ์ ๋ค์ํํ์ฌ ๊ธฐ๋ณธ๊ถ ๋ณด์ฅ๊ณผ ๋ฏผ์ฃผ์ฃผ์ ํ๋ฆฝ์ ์์ด ๋ค๊ฐ์ ์ธ ๋ฒ์ ๋ชจ์์ ๊ฐ๋ฅํ๊ฒ ํ๋ ๊ฒ์ ๊ทผ๋ณธ ๊ท๋ฒ์ผ๋ก ํ๊ณ ์๋ค. ๋์ฑ์ด ํฉ์์ฒด๋ก์์ ๋๋ฒ์ ์๋ฆฌ๋ฅผ ์ฑํํ๊ณ ์๋ ๊ฒ ์ญ์ ๊ทธ ๊ตฌ์ฑ์ ๋ค์์ฑ์ ์์ฒญํ๋ ๊ฒ์ผ๋ก ํด์๋๋ค. ์ด์ ๊ฐ์ ๊ด์ ์์ ๋ณผ ๋ ํ์ง ๋ฒ์์ฅ๊ธ ๊ณ ์๋ฒ๊ด์ ์ค์ฌ์ผ๋ก ๋๋ฒ์์ ๊ตฌ์ฑํ๋ ๊ดํ์ ๊ฐ์ ํ ํ์๊ฐ ์๋ ๊ฒ์ผ๋ก ๋ณด์ธ๋ค.',
'์ฐ๋ฐฉํ๋ฒ์ฌํ์๋ 2001๋
1์ 24์ผ 5:3์ ๋ค์๊ฒฌํด๋ก ใ๋ฒ์์กฐ์ง๋ฒใ ์ 169์กฐ ์ 2๋ฌธ์ด ํ๋ฒ์ ํฉ์น๋๋ค๋ ํ๊ฒฐ์ ๋ด๋ ธ์ โ 5์ธ์ ๋ค์ ์ฌํ๊ด์ ์์ก๊ด๊ณ์ธ์ ์ธ๊ฒฉ๊ถ ๋ณดํธ, ๊ณต์ ํ ์ ์ฐจ์ ๋ณด์ฅ๊ณผ ๋ฐฉํด๋ฐ์ง ์๋ ๋ฒ๊ณผ ์ง์ค ๋ฐ๊ฒฌ ๋ฑ์ ๊ทผ๊ฑฐ๋ก ํ์ฌ ํ
๋ ๋น์ ์ดฌ์์ ๋ํ ์ ๋์ ์ธ ๊ธ์ง๋ฅผ ํ๋ฒ์ ํฉ์นํ๋ ๊ฒ์ผ๋ก ๋ณด์์ โ ๊ทธ๋ฌ๋ ๋๋จธ์ง 3์ธ์ ์ฌํ๊ด์ ํ์ ๋ฒ์์ ์์ก์ ์ฐจ๋ ํน๋ณํ ์ธ๊ฒฉ๊ถ ๋ณดํธ์ ์ด์ต๋ ์์ผ๋ฉฐ, ํ
๋ ๋น์ ๊ณต๊ฐ์ฃผ์๋ก ์ธํด ๋ฒ๊ณผ ์ง์ค ๋ฐ๊ฒฌ์ ๊ณผ์ ์ด ์ธ์ ๋ ์ํ๋กญ๊ฒ ๋๋ ๊ฒ์ ์๋๋ผ๋ฉด์ ๋ฐ๋์๊ฒฌ์ ์ ์ํจ โ ์๋ํ๋ฉด ํ์ ๋ฒ์์ ์์ก์ ์ฐจ์์๋ ์์ก๋น์ฌ์๊ฐ ๊ฐ์ธ์ ์ผ๋ก ์ง์ ์ฌ๋ฆฌ์ ์ฐธ์ํ๊ธฐ๋ณด๋ค๋ ๋ณํธ์ฌ๊ฐ ์ฐธ์ํ๋ ๊ฒฝ์ฐ๊ฐ ๋ง์ผ๋ฉฐ, ์ฌ๋ฆฌ๋์๋ ์ฌ์ค๋ฌธ์ ๊ฐ ์๋ ๋ฒ๋ฅ ๋ฌธ์ ๊ฐ ๋๋ถ๋ถ์ด๊ธฐ ๋๋ฌธ์ด๋ผ๋ ๊ฒ์ โก ํํธ, ์ฐ๋ฐฉํ๋ฒ์ฌํ์๋ ใ์ฐ๋ฐฉํ๋ฒ์ฌํ์๋ฒใ(Bundesverfassungsgerichtsgesetz: BVerfGG) ์ 17a์กฐ์ ๋ฐ๋ผ ์ ํ์ ์ด๋๋ง ์ฌํ์ ๋ํ ๋ฐฉ์ก์ ํ์ฉํ๊ณ ์์ โ ใ์ฐ๋ฐฉํ๋ฒ์ฌํ์๋ฒใ ์ 17์กฐ์์ ใ๋ฒ์์กฐ์ง๋ฒใ ์ 14์ ๋ด์ง ์ 16์ ์ ๊ท์ ์ ์ค์ฉํ๋๋ก ํ๊ณ ์์ง๋ง, ๋
น์์ด๋ ์ดฌ์์ ํตํ ์ฌํ๊ณต๊ฐ์ ๊ด๋ จํ์ฌ์๋ ใ๋ฒ์์กฐ์ง๋ฒใ๊ณผ ๋ค๋ฅธ ๋ด์ฉ์ ๊ท์ ํ๊ณ ์์',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# Results for KURE-v1
# tensor([[1.0000, 0.6967, 0.5306],
# [0.6967, 1.0000, 0.4427],
# [0.5306, 0.4427, 1.0000]])
๐ Documentation
Model Versions
Property | Details |
---|---|
Model Name | KURE-v1, KoE5 |
Dimension | 1024 |
Sequence Length | 8192 for KURE-v1, 512 for KoE5 |
Introduction | KURE-v1 is fine-tuned BAAI/bge-m3 with Korean data via CachedGISTEmbedLoss. KoE5 is fine-tuned intfloat/multilingual-e5-large with ko-triplet-v1.0 via CachedMultipleNegativesRankingLoss |
Model Description
- Developed by: NLP&AI Lab
- Language(s) (NLP): Korean, English
- License: MIT
- Finetuned from model: BAAI/bge-m3
Training Details
Training Data
- KURE-v1: Korean query-document-hard_negative(5) data, 2,000,000 examples
Training Procedure
- Loss: Used CachedGISTEmbedLoss by sentence-transformers
- Batch Size: 4096
- Learning Rate: 2e-05
- Epochs: 1
Evaluation
Metrics
- Recall, Precision, NDCG, F1
Benchmark Datasets
- Ko-StrategyQA: Korean ODQA multi-hop retrieval dataset (translation of StrategyQA)
- AutoRAGRetrieval: Korean document retrieval dataset constructed by parsing PDFs in 5 fields of finance, public, medical, legal, and commerce
- MIRACLRetrieval: Korean document retrieval dataset based on Wikipedia
- PublicHealthQA: Korean document retrieval dataset in the medical and public health domain
- BelebeleRetrieval: Korean document retrieval dataset based on FLORES-200
- MrTidyRetrieval: Korean document retrieval dataset based on Wikipedia
- MultiLongDocRetrieval: Korean long document retrieval dataset in various domains
- XPQARetrieval: Korean document retrieval dataset in various domains
Results
The following are the average results of all models on all benchmark datasets. For detailed results, please visit KURE Github.
Top-k 1
Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
---|---|---|---|---|
nlpai-lab/KURE-v1 | 0.52640 | 0.60551 | 0.60551 | 0.55784 |
dragonkue/BGE-m3-ko | 0.52361 | 0.60394 | 0.60394 | 0.55535 |
BAAI/bge-m3 | 0.51778 | 0.59846 | 0.59846 | 0.54998 |
Snowflake/snowflake-arctic-embed-l-v2.0 | 0.51246 | 0.59384 | 0.59384 | 0.54489 |
nlpai-lab/KoE5 | 0.50157 | 0.57790 | 0.57790 | 0.53178 |
intfloat/multilingual-e5-large | 0.50052 | 0.57727 | 0.57727 | 0.53122 |
jinaai/jina-embeddings-v3 | 0.48287 | 0.56068 | 0.56068 | 0.51361 |
BAAI/bge-multilingual-gemma2 | 0.47904 | 0.55472 | 0.55472 | 0.50916 |
intfloat/multilingual-e5-large-instruct | 0.47842 | 0.55435 | 0.55435 | 0.50826 |
intfloat/multilingual-e5-base | 0.46950 | 0.54490 | 0.54490 | 0.49947 |
intfloat/e5-mistral-7b-instruct | 0.46772 | 0.54394 | 0.54394 | 0.49781 |
Alibaba-NLP/gte-multilingual-base | 0.46469 | 0.53744 | 0.53744 | 0.49353 |
Alibaba-NLP/gte-Qwen2-7B-instruct | 0.46633 | 0.53625 | 0.53625 | 0.49429 |
openai/text-embedding-3-large | 0.44884 | 0.51688 | 0.51688 | 0.47572 |
Salesforce/SFR-Embedding-2_R | 0.43748 | 0.50815 | 0.50815 | 0.46504 |
upskyy/bge-m3-korean | 0.43125 | 0.50245 | 0.50245 | 0.45945 |
jhgan/ko-sroberta-multitask | 0.33788 | 0.38497 | 0.38497 | 0.35678 |
Top-k 3
Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
---|---|---|---|---|
nlpai-lab/KURE-v1 | 0.68678 | 0.28711 | 0.65538 | 0.39835 |
dragonkue/BGE-m3-ko | 0.67834 | 0.28385 | 0.64950 | 0.39378 |
BAAI/bge-m3 | 0.67526 | 0.28374 | 0.64556 | 0.39291 |
Snowflake/snowflake-arctic-embed-l-v2.0 | 0.67128 | 0.28193 | 0.64042 | 0.39072 |
intfloat/multilingual-e5-large | 0.65807 | 0.27777 | 0.62822 | 0.38423 |
nlpai-lab/KoE5 | 0.65174 | 0.27329 | 0.62369 | 0.37882 |
BAAI/bge-multilingual-gemma2 | 0.64415 | 0.27416 | 0.61105 | 0.37782 |
jinaai/jina-embeddings-v3 | 0.64116 | 0.27165 | 0.60954 | 0.37511 |
intfloat/multilingual-e5-large-instruct | 0.64353 | 0.27040 | 0.60790 | 0.37453 |
Alibaba-NLP/gte-multilingual-base | 0.63744 | 0.26404 | 0.59695 | 0.36764 |
Alibaba-NLP/gte-Qwen2-7B-instruct | 0.63163 | 0.25937 | 0.59237 | 0.36263 |
intfloat/multilingual-e5-base | 0.62099 | 0.26144 | 0.59179 | 0.36203 |
intfloat/e5-mistral-7b-instruct | 0.62087 | 0.26144 | 0.58917 | 0.36188 |
openai/text-embedding-3-large | 0.61035 | 0.25356 | 0.57329 | 0.35270 |
Salesforce/SFR-Embedding-2_R | 0.60001 | 0.25253 | 0.56346 | 0.34952 |
upskyy/bge-m3-korean | 0.59215 | 0.25076 | 0.55722 | 0.34623 |
jhgan/ko-sroberta-multitask | 0.46930 | 0.18994 | 0.43293 | 0.26696 |
Top-k 5
Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
---|---|---|---|---|
nlpai-lab/KURE-v1 | 0.73851 | 0.19130 | 0.67479 | 0.29903 |
dragonkue/BGE-m3-ko | 0.72517 | 0.18799 | 0.66692 | 0.29401 |
BAAI/bge-m3 | 0.72954 | 0.18975 | 0.66615 | 0.29632 |
Snowflake/snowflake-arctic-embed-l-v2.0 | 0.72962 | 0.18875 | 0.66236 | 0.29542 |
nlpai-lab/KoE5 | 0.70820 | 0.18287 | 0.64499 | 0.28628 |
intfloat/multilingual-e5-large | 0.70124 | 0.18316 | 0.64402 | 0.28588 |
BAAI/bge-multilingual-gemma2 | 0.70258 | 0.18556 | 0.63338 | 0.28851 |
jinaai/jina-embeddings-v3 | 0.69933 | 0.18256 | 0.63133 | 0.28505 |
intfloat/multilingual-e5-large-instruct | 0.69018 | 0.17838 | 0.62486 | 0.27933 |
Alibaba-NLP/gte-multilingual-base | 0.69365 | 0.17789 | 0.61896 | 0.27879 |
intfloat/multilingual-e5-base | 0.67250 | 0.17406 | 0.61119 | 0.27247 |
Alibaba-NLP/gte-Qwen2-7B-instruct | 0.67447 | 0.17114 | 0.60952 | 0.26943 |
intfloat/e5-mistral-7b-instruct | 0.67449 | 0.17484 | 0.60935 | 0.27349 |
openai/text-embedding-3-large | 0.66365 | 0.17004 | 0.59389 | 0.26677 |
Salesforce/SFR-Embedding-2_R | 0.65622 | 0.17018 | 0.58494 | 0.26612 |
upskyy/bge-m3-korean | 0.65477 | 0.17015 | 0.58073 | 0.26589 |
jhgan/ko-sroberta-multitask | 0.53136 | 0.13264 | 0.45879 | 0.20976 |
Top-k 10
Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
---|---|---|---|---|
nlpai-lab/KURE-v1 | 0.79682 | 0.10624 | 0.69473 | 0.18524 |
dragonkue/BGE-m3-ko | 0.78450 | 0.10492 | 0.68748 | 0.18288 |
BAAI/bge-m3 | 0.79195 | 0.10592 | 0.68723 | 0.18456 |
Snowflake/snowflake-arctic-embed-l-v2.0 | 0.78669 | 0.10462 | ... | ... |
๐ License
This project is licensed under the MIT license.
Jina Embeddings V3
Jina Embeddings V3 is a multilingual sentence embedding model supporting over 100 languages, specializing in sentence similarity and feature extraction tasks.
Text Embedding
Transformers Supports Multiple Languages

J
jinaai
3.7M
911
Ms Marco MiniLM L6 V2
Apache-2.0
A cross-encoder model trained on the MS Marco passage ranking task for query-passage relevance scoring in information retrieval
Text Embedding English
M
cross-encoder
2.5M
86
Opensearch Neural Sparse Encoding Doc V2 Distill
Apache-2.0
A sparse retrieval model based on distillation technology, optimized for OpenSearch, supporting inference-free document encoding with improved search relevance and efficiency over V1
Text Embedding
Transformers English

O
opensearch-project
1.8M
7
Sapbert From PubMedBERT Fulltext
Apache-2.0
A biomedical entity representation model based on PubMedBERT, optimized for semantic relation capture through self-aligned pre-training
Text Embedding English
S
cambridgeltl
1.7M
49
Gte Large
MIT
GTE-Large is a powerful sentence transformer model focused on sentence similarity and text embedding tasks, excelling in multiple benchmark tests.
Text Embedding English
G
thenlper
1.5M
278
Gte Base En V1.5
Apache-2.0
GTE-base-en-v1.5 is an English sentence transformer model focused on sentence similarity tasks, excelling in multiple text embedding benchmarks.
Text Embedding
Transformers Supports Multiple Languages

G
Alibaba-NLP
1.5M
63
Gte Multilingual Base
Apache-2.0
GTE Multilingual Base is a multilingual sentence embedding model supporting over 50 languages, suitable for tasks like sentence similarity calculation.
Text Embedding
Transformers Supports Multiple Languages

G
Alibaba-NLP
1.2M
246
Polybert
polyBERT is a chemical language model designed to achieve fully machine-driven ultrafast polymer informatics. It maps PSMILES strings into 600-dimensional dense fingerprints to numerically represent polymer chemical structures.
Text Embedding
Transformers

P
kuelumbus
1.0M
5
Bert Base Turkish Cased Mean Nli Stsb Tr
Apache-2.0
A sentence embedding model based on Turkish BERT, optimized for semantic similarity tasks
Text Embedding
Transformers Other

B
emrecan
1.0M
40
GIST Small Embedding V0
MIT
A text embedding model fine-tuned based on BAAI/bge-small-en-v1.5, trained with the MEDI dataset and MTEB classification task datasets, optimized for query encoding in retrieval tasks.
Text Embedding
Safetensors English
G
avsolatorio
945.68k
29
Featured Recommended AI Models
ยฉ 2025AIbase