BGE-Reranker-V2-M3-KO Open-Source Korean Re-ranking Model - Effortlessly Handle Text Sorting Tasks

Bge Reranker V2 M3 Ko

Developed by dragonkue

This is a Korean reranking model optimized based on BAAI/bge-reranker-v2-m3, primarily used for text reranking tasks.

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Korean optimization #Financial text reranking #High-precision reranking

Downloads 877

Release Time : 10/16/2024

Model Overview

This model is a cross-encoder that directly takes a query and document as input and outputs a similarity score. By inputting queries and passages, the model returns relevance scores, suitable for information retrieval and document reranking tasks.

Model Features

Multilingual support

Supports Korean and English, with special optimization for Korean.

High-precision reranking

Directly computes similarity scores for text pairs, achieving higher accuracy than dual-encoder models.

Multiple usage options

Supports usage via Transformers, SentenceTransformers, and FlagEmbedding libraries.

Model Capabilities

Text similarity calculation

Document reranking

Information retrieval

Use Cases

Information retrieval

Financial document retrieval

Used for retrieving Korean financial documents such as legal provisions and policy documents.

Achieved a Top-1 F1 score of 0.9123 in Korean financial domain benchmarks.

Question answering systems

Question-answer matching

Used to calculate the relevance between questions and candidate answers, selecting the best-matching answer.

🚀 Reranker (Cross-Encoder)

Different from embedding models, rerankers take questions and documents as input and directly output similarity scores instead of embeddings. You can obtain a relevance score by inputting a query and a passage into the reranker, and this score can be mapped to a float value in the range of [0,1] using the sigmoid function.

Model Image

🚀 Quick Start

The reranker is a powerful tool for text ranking. It offers a more accurate way to determine the relevance between a query and a passage compared to traditional embedding models.

✨ Features

Direct Similarity Output: Unlike embedding models, it directly outputs similarity scores.
Multilingual Support: Optimized for Korean, suitable for a wide range of multilingual tasks.

📦 Installation

To use the reranker, you need to install the necessary libraries. Here are the installation commands for different libraries:

Install Sentence Transformers

pip install -U sentence-transformers

Install FlagEmbedding

pip install -U FlagEmbedding

💻 Usage Examples

Basic Usage with Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained('dragonkue/bge-reranker-v2-m3-ko')
tokenizer = AutoTokenizer.from_pretrained('dragonkue/bge-reranker-v2-m3-ko')

features = tokenizer([['몇 년도에 지방세외수입법이 시행됐을까?', '실무교육을 통해 ‘지방세외수입법’에 대한 자치단체의 관심을 제고하고 자치단체의 차질 없는 업무 추진을 지원하였다. 이러한 준비과정을 거쳐 2014년 8월 7일부터 ‘지방세외수입법’이 시행되었다.'], 
['몇 년도에 지방세외수입법이 시행됐을까?', '식품의약품안전처는 21일 국내 제약기업 유바이오로직스가 개발 중인 신종 코로나바이러스 감염증(코로나19) 백신 후보물질 ‘유코백-19’의 임상시험 계획을 지난 20일 승인했다고 밝혔다.']],  padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    logits = model(**features).logits
    scores = torch.sigmoid(logits)
    print(scores)
# [9.9997962e-01 5.0702977e-07]

Usage with SentenceTransformers

from sentence_transformers import CrossEncoder

model = CrossEncoder('dragonkue/bge-reranker-v2-m3-ko', default_activation_function=torch.nn.Sigmoid())

scores = model.predict([['몇 년도에 지방세외수입법이 시행됐을까?', '실무교육을 통해 ‘지방세외수입법’에 대한 자치단체의 관심을 제고하고 자치단체의 차질 없는 업무 추진을 지원하였다. 이러한 준비과정을 거쳐 2014년 8월 7일부터 ‘지방세외수입법’이 시행되었다.'], 
['몇 년도에 지방세외수입법이 시행됐을까?', '식품의약품안전처는 21일 국내 제약기업 유바이오로직스가 개발 중인 신종 코로나바이러스 감염증(코로나19) 백신 후보물질 ‘유코백-19’의 임상시험 계획을 지난 20일 승인했다고 밝혔다.']])
print(scores)
# [9.9997962e-01 5.0702977e-07]

Usage with FlagEmbedding

from FlagEmbedding import FlagReranker

reranker = FlagReranker('dragonkue/bge-reranker-v2-m3-ko')

scores = reranker.compute_score([['몇 년도에 지방세외수입법이 시행됐을까?', '실무교육을 통해 ‘지방세외수입법’에 대한 자치단체의 관심을 제고하고 자치단체의 차질 없는 업무 추진을 지원하였다. 이러한 준비과정을 거쳐 2014년 8월 7일부터 ‘지방세외수입법’이 시행되었다.'], 
['몇 년도에 지방세외수입법이 시행됐을까?', '식품의약품안전처는 21일 국내 제약기업 유바이오로직스가 개발 중인 신종 코로나바이러스 감염증(코로나19) 백신 후보물질 ‘유코백-19’의 임상시험 계획을 지난 20일 승인했다고 밝혔다.']], normalize=True)
print(scores)
# [9.9997962e-01 5.0702977e-07]

📚 Documentation

Model Details

Property	Details
Model Type	Reranker (Cross-Encoder)
Base Model	BAAI/bge-reranker-v2-m3
Training Data	Not specified
Optimized for	Korean

Fine-tune

For fine-tuning instructions, please refer to FlagOpen/FlagEmbedding.

Evaluation

Bi-encoder and Cross-encoder

Bi-Encoders convert texts into fixed-size vectors and efficiently calculate similarities between them. They are fast and ideal for tasks like semantic search and classification, making them suitable for processing large datasets quickly.

Cross-Encoders directly compare pairs of texts to compute similarity scores, providing more accurate results. While they are slower due to needing to process each pair, they excel in re-ranking top results and are important in Advanced RAG techniques for enhancing text generation.

Korean Embedding Benchmark with AutoRAG

Korean Embedding Benchmark for Financial Sector

Top-k 1

Bi-Encoder (Sentence Transformer)

Model name	F1	Recall	Precision
paraphrase-multilingual-mpnet-base-v2	0.3596	0.3596	0.3596
KoSimCSE-roberta	0.4298	0.4298	0.4298
Cohere embed-multilingual-v3.0	0.3596	0.3596	0.3596
openai ada 002	0.4737	0.4737	0.4737
multilingual-e5-large-instruct	0.4649	0.4649	0.4649
Upstage Embedding	0.6579	0.6579	0.6579
paraphrase-multilingual-MiniLM-L12-v2	0.2982	0.2982	0.2982
openai_embed_3_small	0.5439	0.5439	0.5439
ko-sroberta-multitask	0.4211	0.4211	0.4211
openai_embed_3_large	0.6053	0.6053	0.6053
KU-HIAI-ONTHEIT-large-v1	0.7105	0.7105	0.7105
KU-HIAI-ONTHEIT-large-v1.1	0.7193	0.7193	0.7193
kf-deberta-multitask	0.4561	0.4561	0.4561
gte-multilingual-base	0.5877	0.5877	0.5877
KoE5	0.7018	0.7018	0.7018
BGE-m3	0.6578	0.6578	0.6578
bge-m3-korean	0.5351	0.5351	0.5351
BGE-m3-ko	0.7456	0.7456	0.7456

Cross-Encoder (Reranker)

Model name	F1	Recall	Precision
gte-multilingual-reranker-base	0.7281	0.7281	0.7281
jina-reranker-v2-base-multilingual	0.8070	0.8070	0.8070
bge-reranker-v2-m3	0.8772	0.8772	0.8772
upskyy/ko-reranker-8k	0.8684	0.8684	0.8684
upskyy/ko-reranker	0.8333	0.8333	0.8333
mncai/bge-ko-reranker-560M	0.0088	0.0088	0.0088
Dongjin-kr/ko-reranker	0.8509	0.8509	0.8509
bge-reranker-v2-m3-ko	0.9123	0.9123	0.9123

Top-k 3

Bi-Encoder (Sentence Transformer)

Model name	F1	Recall	Precision
paraphrase-multilingual-mpnet-base-v2	0.2368	0.4737	0.1579
KoSimCSE-roberta	0.3026	0.6053	0.2018
Cohere embed-multilingual-v3.0	0.2851	0.5702	0.1901
openai ada 002	0.3553	0.7105	0.2368
multilingual-e5-large-instruct	0.3333	0.6667	0.2222
Upstage Embedding	0.4211	0.8421	0.2807
paraphrase-multilingual-MiniLM-L12-v2	0.2061	0.4123	0.1374
openai_embed_3_small	0.3640	0.7281	0.2427
ko-sroberta-multitask	0.2939	0.5877	0.1959
openai_embed_3_large	0.3947	0.7895	0.2632
KU-HIAI-ONTHEIT-large-v1	0.4386	0.8772	0.2924
KU-HIAI-ONTHEIT-large-v1.1	0.4430	0.8860	0.2953
kf-deberta-multitask	0.3158	0.6316	0.2105
gte-multilingual-base	0.4035	0.8070	0.2690
KoE5	0.4254	0.8509	0.2836
BGE-m3	0.4254	0.8508	0.2836
bge-m3-korean	0.3684	0.7368	0.2456
BGE-m3-ko	0.4517	0.9035	0.3011

Cross-Encoder (Reranker)

Model name	F1	Recall	Precision
gte-multilingual-reranker-base	0.4605	0.9211	0.3070
jina-reranker-v2-base-multilingual	0.4649	0.9298	0.3099
bge-reranker-v2-m3	0.4781	0.9561	0.3187
upskyy/ko-reranker-8k	0.4781	0.9561	0.3187
upskyy/ko-reranker	0.4649	0.9298	0.3099
mncai/bge-ko-reranker-560M	0.0044	0.0088	0.0029
Dongjin-kr/ko-reranker	0.4737	0.9474	0.3158
bge-reranker-v2-m3-ko	0.4825	0.9649	0.3216

📄 License

This project is licensed under the Apache-2.0 License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご