ko-reranker-8kオープンソーステキストソートモデル - 韓国語データで微調整し、テキスト内容を高精度にソートする

ホーム

Ko Reranker 8k

upskyyによって開発

BAAI/bge-reranker-v2-m3モデルを基に、韓国語データでファインチューニングしたテキストランキングモデル

テキスト埋め込み

Transformers

複数言語対応オープンソースライセンス:Apache-2.0 #韓国語リランキング #多言語サポート #高精度関連性スコアリング

ダウンロード数 14

リリース時間 : 8/16/2024

モデル概要

このモデルはテキストランキングモデルで、韓国語と多言語テキストに特化して最適化されており、クエリ文とテキスト段落間の関連性スコアを計算できます。

モデル特徴

韓国語最適化

韓国語データでファインチューニングされており、韓国語テキストランキングタスクに特に適しています

多言語サポート

韓国語以外にも複数の言語をサポート

効率的な計算

FP16による高速計算をサポートし、処理効率を向上

スコア正規化

関連性スコアを0-1範囲にマッピングするオプション機能で比較が容易

モデル能力

テキスト関連性スコアリング

多言語テキスト処理

クエリ-段落マッチング

使用事例

情報検索

検索エンジン結果のランキング

検索エンジンが返す結果を関連性でランキング

検索結果の関連性向上

質問応答システム

候補回答から最も関連性の高い回答を選択

QAシステムの精度向上

コンテンツ推薦

ニュース推薦

ユーザークエリに基づいて最も関連性の高いニュースを推薦

コンテンツ推薦の精度向上

🚀 upskyy/ko-reranker-8k

ko-reranker-8kはBAAI/bge-reranker-v2-m3モデルに韓国語データをfine-tuningしたモデルです。

🚀 クイックスタート

このセクションでは、upskyy/ko-reranker-8kモデルの使用方法を説明します。

📦 インストール

FlagEmbeddingを使用する場合

pip install -U FlagEmbedding

💻 使用例

基本的な使用法

FlagEmbeddingを使用する場合

関連性スコアを取得します（スコアが高いほど関連性が高いことを示します）。

from FlagEmbedding import FlagReranker

reranker = FlagReranker('upskyy/ko-reranker-8k', use_fp16=True) # use_fp16をTrueに設定すると、多少の性能低下を伴いますが、計算が高速化されます

score = reranker.compute_score(['query', 'passage'])
print(score) # -8.3828125

# スコアを0-1に正規化するには、"normalize=True"を設定します。これにより、スコアにシグモイド関数が適用されます
score = reranker.compute_score(['query', 'passage'], normalize=True)
print(score) # 0.000228713314721116

scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
print(scores) # [-11.2265625, 8.6875]

# スコアを0-1に正規化するには、"normalize=True"を設定します。これにより、スコアにシグモイド関数が適用されます
scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], normalize=True)
print(scores) # [1.3315579521758342e-05, 0.9998313472460109]

Huggingface transformersを使用する場合

関連性スコアを取得します（スコアが高いほど関連性が高いことを示します）。

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('upskyy/ko-reranker-8k')
model = AutoModelForSequenceClassification.from_pretrained('upskyy/ko-reranker-8k')
model.eval()

pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)

📚 ドキュメント

引用

@misc{li2023making,
      title={Making Large Language Models A Better Foundation For Dense Retrieval}, 
      author={Chaofan Li and Zheng Liu and Shitao Xiao and Yingxia Shao},
      year={2023},
      eprint={2312.15503},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{chen2024bge,
      title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, 
      author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
      year={2024},
      eprint={2402.03216},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}