ko-reranker開源韓語文本排序模型 - 基於微調技術精準排列文本

首頁

Ko Reranker

由upskyy開發

ko-reranker是基於BAAI/bge-reranker-large模型，使用韓語數據微調的文本排序模型。

文本嵌入

Transformers

支持多種語言開源協議:MIT #韓語重排序 #多語言支持 #文本相關性評分

下載量 81

發布時間 : 8/16/2024

模型概述

該模型主要用於文本排序任務，能夠計算查詢與段落之間的相關性分數，分數越高表示相關性越強。支持韓語、英語和中文。

模型特點

多語言支持

支持韓語、英語和中文三種語言的文本排序任務。

微調優化

基於BAAI/bge-reranker-large模型，使用韓語數據進行了專門微調，優化了韓語文本排序性能。

分數歸一化

提供分數歸一化功能，可將相關性分數映射到0-1之間，便於比較和理解。

模型能力

文本相關性計算

多語言文本排序

查詢-段落匹配度評估

使用案例

信息檢索

搜索引擎結果排序

對搜索引擎返回的結果進行相關性排序，提升用戶體驗。

能夠有效區分相關和不相關的搜索結果

問答系統

答案候選排序

對問答系統中生成的多個候選答案進行相關性排序。

提高最佳答案被優先展示的概率

🚀 upskyy/ko-reranker

ko-reranker是基於BAAI/bge-reranker-large模型，使用韓語數據進行微調得到的模型。它可用於文本排序任務，為文本相關性打分。

🚀 快速開始

本模型支持多種使用方式，你可以根據自己的需求選擇合適的庫來使用。

📦 安裝指南

使用FlagEmbedding

pip install -U FlagEmbedding

使用Sentence-Transformers

pip install -U sentence-transformers

💻 使用示例

使用FlagEmbedding

from FlagEmbedding import FlagReranker

# Setting use_fp16 to True speeds up computation with a slight performance degradation
reranker = FlagReranker('upskyy/ko-reranker', use_fp16=True) 

# Get relevance scores (higher scores indicate more relevance)
score = reranker.compute_score(['query', 'passage'])
print(score) # -1.861328125

# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
score = reranker.compute_score(['query', 'passage'], normalize=True)
print(score) # 0.13454832326359276

scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
print(scores) # [-7.37109375, 8.5390625]

# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], normalize=True)
print(scores) # [0.0006287840192903181, 0.9998043646624727]

使用Sentence-Transformers

from sentence_transformers import SentenceTransformer

sentences_1 = ["경제 전문가가 금리 인하에 대한 예측을 하고 있다.", "주식 시장에서 한 투자자가 주식을 매수한다."]
sentences_2 = ["한 투자자가 비트코인을 매수한다.", "금융 거래소에서 새로운 디지털 자산이 상장된다."]

model = SentenceTransformer('upskyy/ko-reranker')

embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T

print(similarity)

使用Huggingface transformers

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('upskyy/ko-reranker')
model = AutoModelForSequenceClassification.from_pretrained('upskyy/ko-reranker')
model.eval()

pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]

with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)

📚 詳細文檔

引用

@misc{bge_embedding,
      title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, 
      author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
      year={2023},
      eprint={2309.07597},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}