bge-reranker-ftオープンソースモデル - テキストの再ランキングと意味検索に適したテキストペア評価ツール

ホーム

Bge Reranker Ft

foochunによって開発

これはBAAI/bge-reranker-baseからファインチューニングされたクロスエンコーダーモデルで、テキストペアのスコアリングに使用され、テキスト再ランキングやセマンティック検索タスクに適しています。

テキスト埋め込み

Safetensors

#テキスト再ランキング #名前マッチング最適化 #短いテキストの類似度

ダウンロード数 70

リリース時間 : 5/5/2025

モデル概要

このモデルはBAAI/bge-reranker-baseを基にファインチューニングされ、sentence-transformersライブラリでトレーニングされました。テキストペアの類似度スコアを計算でき、主にテキスト再ランキングやセマンティック検索シナリオで使用されます。

モデル特徴

効率的なテキストペアスコアリング

2つのテキスト間の関連性スコアを迅速に計算でき、大規模なテキストランキングシナリオに適しています。

BGE-rerankerに基づくファインチューニング

BAAI/bge-reranker-baseを基にファインチューニングされ、元のモデルの優れた性能を継承しています。

多重ネガティブサンプルトレーニング

多重ネガティブサンプルランキング損失を使用してトレーニングされ、モデルの識別能力を向上させました。

モデル能力

テキスト類似度計算

セマンティック検索

テキスト再ランキング

使用事例

情報検索

検索エンジン結果の再ランキング

検索エンジンが返した結果を再ランキングし、最も関連性の高い結果の順位を上げます。

名前マッチング

名前のバリエーション認識

異なる表記形式の名前が同一人物を指しているかどうかを識別します。例えば'zach koh yong liang'と'yong liang koh zach'など。

🚀 BAAI/bge-reranker-baseベースのCross Encoderモデル

このモデルは、sentence-transformersライブラリを使用して、BAAI/bge-reranker-baseからファインチューニングされたCross Encoderモデルです。テキストペアのスコアを計算し、テキストの再ランキングや意味検索に利用できます。

📚 ドキュメント

モデルの詳細

モデルの説明

属性	詳情
モデルタイプ	Cross Encoder
ベースモデル	BAAI/bge-reranker-base
最大シーケンス長	512トークン
出力ラベル数	1ラベル

モデルのソース

ドキュメント：Sentence Transformers Documentation
ドキュメント：Cross Encoder Documentation
リポジトリ：Sentence Transformers on GitHub
Hugging Face：Cross Encoders on Hugging Face

💻 使用例

基本的な使用法

まず、Sentence Transformersライブラリをインストールします。

pip install -U sentence-transformers

次に、このモデルをロードして推論を実行できます。

from sentence_transformers import CrossEncoder

# Download from the ðŸ¤— Hub
model = CrossEncoder("foochun/bge-reranker-ft")
# Get scores for pairs of texts
pairs = [
    ['zach koh yong liang', 'yong liang koh zach'],
    ['zulkifli bin mohamad', 'zulkifli bin muhammad'],
    ['rahman bin mohd rashid', 'rahman mohammed rashid'],
    ['mohd syukri bin bakar', 'muhd syukri bakar'],
    ['carmen tan fang kiat', 'tan fang kiat'],
]
scores = model.predict(pairs)
print(scores.shape)
# (5,)

# Or rank different texts based on similarity to a single text
ranks = model.rank(
    'zach koh yong liang',
    [
        'yong liang koh zach',
        'zulkifli bin muhammad',
        'rahman mohammed rashid',
        'muhd syukri bakar',
        'tan fang kiat',
    ]
)
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]

🔧 技術詳細

学習データセット

無名データセット

サイズ：72,905個の学習サンプル
列：query、pos、neg
最初の1000サンプルに基づく概算統計： | | query | pos | neg | |:--------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------| | タイプ | string | string | string | | 詳細 |
- 最小: 9文字
- 平均: 19.91文字
- 最大: 45文字
|
- 最小: 9文字
- 平均: 17.64文字
- 最大: 40文字
|
- 最小: 9文字
- 平均: 17.95文字
- 最大: 37文字
|
サンプル： | query | pos | neg | |:-------------------------------------------|:-------------------------------------|:-----------------------------------| | sim hong soon | sim hong soon | sim soon hong | | raja mariam binti raja sharif | raja mariam raja sharif | zuraidah binti dollah | | saw ann fui | fui saw ann | ann saw fui |
損失関数：MultipleNegativesRankingLoss パラメータ：

{
    "scale": 10.0,
    "num_negatives": 4,
    "activation_fn": "torch.nn.modules.activation.Sigmoid"
}

評価データセット

無名データセット

サイズ：10,415個の評価サンプル
列：query、pos、neg
最初の1000サンプルに基づく概算統計： | | query | pos | neg | |:--------|:----------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------| | タイプ | string | string | string | | 詳細 |
- 最小: 9文字
- 平均: 19.95文字
- 最大: 43文字
|
- 最小: 9文字
- 平均: 17.8文字
- 最大: 42文字
|
- 最小: 8文字
- 平均: 18.33文字
- 最大: 36文字
|
サンプル： | query | pos | neg | |:------------------------------------|:------------------------------------|:---------------------------------| | zach koh yong liang | yong liang koh zach | liang yong koh zach | | zulkifli bin mohamad | zulkifli bin muhammad | razak bin ibrahim | | rahman bin mohd rashid | rahman mohammed rashid | fauzi bin mohd |
損失関数：MultipleNegativesRankingLoss パラメータ：

{
    "scale": 10.0,
    "num_negatives": 4,
    "activation_fn": "torch.nn.modules.activation.Sigmoid"
}

学習ハイパーパラメータ

非デフォルトのハイパーパラメータ

eval_strategy: steps
per_device_train_batch_size: 64
per_device_eval_batch_size: 64
learning_rate: 1e-05
warmup_ratio: 0.1
seed: 12
fp16: True
dataloader_num_workers: 4
load_best_model_at_end: True
batch_sampler: no_duplicates

学習ログ

エポック	ステップ	学習損失
0.0009	1	0.5117
0.8772	1000	0.0955
1.7544	2000	0.005
2.6316	3000	0.0039

フレームワークのバージョン

Python: 3.11.9
Sentence Transformers: 4.1.0
Transformers: 4.51.3
PyTorch: 2.6.0+cu124
Accelerate: 1.6.0
Datasets: 3.6.0
Tokenizers: 0.21.1

📄 引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}