reranker - msmarco - ModernBERT - base - lambdalossオープンソースモデル - テキストの再ランキングと意味検索の強力なツール

ホーム

Reranker Msmarco ModernBERT Base Lambdaloss

tomaarsenによって開発

これはModernBERT-baseから微調整されたクロスエンコーダモデルで、テキストペアのスコアを計算するために使用され、テキストの再ランキングと意味検索タスクに適しています。

テキスト埋め込み

Safetensors

英語オープンソースライセンス:Apache-2.0 #テキストの再ランキング #意味検索 #高精度評点

ダウンロード数 89

リリース時間 : 3/17/2025

モデル概要

このモデルはModernBERT-baseアーキテクチャに基づいており、sentence-transformersライブラリを使用してmsmarcoデータセットで訓練され、テキストペアの類似度スコアを計算するために特別に設計されており、情報検索、質問応答システムなどのシナリオに適用できます。

モデル特徴

効率的なテキスト再ランキング

テキストペアの類似度スコアを迅速に計算し、検索システムのランキング品質を効果的に向上させることができます。

大きなシーケンス長のサポート

最大8192個のトークンのシーケンス長をサポートし、長文テキストの処理に適しています。

高性能指標

複数の評価データセットで優れた性能を発揮し、例えばNanoMSMARCO_R100ではndcg@10が0.7251に達します。

モデル能力

テキスト類似度計算

情報検索結果の再ランキング

質問応答システムの回答ランキング

意味検索

使用事例

情報検索

検索エンジン結果の再ランキング

検索エンジンが返した結果を二次的にランキングし、関連するドキュメントのランクを向上させます。

MSMARCOデータセットでmapが0.6768に達します。

質問応答システム

回答の関連性ランキング

候補回答の関連性を評点し、最も関連する回答を選択します。

NanoNQ_R100データセットでmrr@10が0.7402に達します。

🚀 answerdotai/ModernBERT-baseベースのクロスエンコーダ

このモデルはanswerdotai/ModernBERT-baseをベースにしたクロスエンコーダで、msmarcoデータセットを使用してsentence-transformersライブラリで微調整されています。テキストペアのスコアを計算することができ、テキストの再ランキングや意味検索に利用できます。

🚀 クイックスタート

Sentence Transformersを使った直接利用

まず、Sentence Transformersライブラリをインストールします。

pip install -U sentence-transformers

次に、このモデルをロードして推論を行うことができます。

from sentence_transformers import CrossEncoder

# 🤗 Hubからモデルをダウンロード
model = CrossEncoder("tomaarsen/reranker-msmarco-ModernBERT-base-lambdaloss")
# テキストペアのスコアを取得
pairs = [
    ['How many calories in an egg', 'There are on average between 55 and 80 calories in an egg depending on its size.'],
    ['How many calories in an egg', 'Egg whites are very low in calories, have no fat, no cholesterol, and are loaded with protein.'],
    ['How many calories in an egg', 'Most of the calories in an egg come from the yellow yolk in the center.'],
]
scores = model.predict(pairs)
print(scores.shape)
# (3,)

# または、単一のテキストとの類似度に基づいて異なるテキストを並べ替える
ranks = model.rank(
    'How many calories in an egg',
    [
        'There are on average between 55 and 80 calories in an egg depending on its size.',
        'Egg whites are very low in calories, have no fat, no cholesterol, and are loaded with protein.',
        'Most of the calories in an egg come from the yellow yolk in the center.',
    ]
)
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]

✨ 主な機能

answerdotai/ModernBERT-baseモデルをベースに微調整されており、優れたテキスト処理能力を持っています。
テキストペアのスコアを計算でき、テキストの再ランキングや意味検索に利用できます。
最大8192トークンの入力シーケンスをサポートしています。

📦 インストール

Sentence Transformersライブラリをインストールします。

pip install -U sentence-transformers

💻 使用例

基本的な使用法

from sentence_transformers import CrossEncoder

# 🤗 Hubからモデルをダウンロード
model = CrossEncoder("tomaarsen/reranker-msmarco-ModernBERT-base-lambdaloss")
# テキストペアのスコアを取得
pairs = [
    ['How many calories in an egg', 'There are on average between 55 and 80 calories in an egg depending on its size.'],
    ['How many calories in an egg', 'Egg whites are very low in calories, have no fat, no cholesterol, and are loaded with protein.'],
    ['How many calories in an egg', 'Most of the calories in an egg come from the yellow yolk in the center.'],
]
scores = model.predict(pairs)
print(scores.shape)
# (3,)

高度な使用法

from sentence_transformers import CrossEncoder

# 🤗 Hubからモデルをダウンロード
model = CrossEncoder("tomaarsen/reranker-msmarco-ModernBERT-base-lambdaloss")
# 単一のテキストとの類似度に基づいて異なるテキストを並べ替える
ranks = model.rank(
    'How many calories in an egg',
    [
        'There are on average between 55 and 80 calories in an egg depending on its size.',
        'Egg whites are very low in calories, have no fat, no cholesterol, and are loaded with protein.',
        'Most of the calories in an egg come from the yellow yolk in the center.',
    ]
)
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]

📚 ドキュメント

モデルの詳細

モデルの説明

属性	詳細
モデルタイプ	クロスエンコーダ
ベースモデル	answerdotai/ModernBERT-base
最大シーケンス長	8192トークン
出力ラベル数	1つのラベル
学習データセット	msmarco
言語	英語

モデルのソース

評価

指標

クロスエンコーダの再ランキング

データセット：NanoMSMARCO_R100、NanoNFCorpus_R100、NanoNQ_R100
CrossEncoderRerankingEvaluatorを使用して評価し、パラメータは以下の通りです。
```
{
    "at_k": 10,
    "always_rerank_positives": true
}
```

指標	NanoMSMARCO_R100	NanoNFCorpus_R100	NanoNQ_R100
map	0.6768 (+0.1872)	0.3576 (+0.0966)	0.7134 (+0.2938)
mrr@10	0.6690 (+0.1915)	0.5819 (+0.0820)	0.7402 (+0.3135)
ndcg@10	0.7251 (+0.1847)	0.4143 (+0.0892)	0.7594 (+0.2587)

クロスエンコーダのNano BEIR

データセット：NanoBEIR_R100_mean

CrossEncoderNanoBEIREvaluatorを使用して評価し、パラメータは以下の通りです。

{
    "dataset_names": [
        "msmarco",
        "nfcorpus",
        "nq"
    ],
    "rerank_k": 100,
    "at_k": 10,
    "always_rerank_positives": true
}

指標	値
map	0.5826 (+0.1925)
mrr@10	0.6637 (+0.1957)
ndcg@10	0.6329 (+0.1776)

学習の詳細

学習データセット

データセット：msmarco（バージョン：a0537b6）
サイズ：399,282個の学習サンプル
列：query_id、doc_ids、labels

評価データセット

データセット：msmarco（バージョン：a0537b6）
サイズ：1,000個の評価サンプル
列：query_id、doc_ids、labels

学習ハイパーパラメータ

デフォルトではないハイパーパラメータ：
- eval_strategy: steps
- num_train_epochs: 1
- warmup_ratio: 0.1
- seed: 12
- bf16: True
- load_best_model_at_end: True

フレームワークのバージョン

Python: 3.11.10
Sentence Transformers: 3.5.0.dev0
Transformers: 4.49.0
PyTorch: 2.5.1+cu124
Accelerate: 1.2.0
Datasets: 2.21.0
Tokenizers: 0.21.0

🔧 技術詳細

損失関数

LambdaLoss損失関数を使用し、パラメータは以下の通りです。

{
    "weighting_scheme": "sentence_transformers.cross_encoder.losses.LambdaLoss.NDCGLoss2PPScheme",
    "k": null,
    "sigma": 1.0,
    "eps": 1e-10,
    "reduction_log": "binary",
    "activation_fct": "torch.nn.modules.linear.Identity",
    "mini_batch_size": 8
}

学習ログ

クリックして展開

エポック	ステップ	学習損失	検証損失	NanoMSMARCO_R100_ndcg@10	NanoNFCorpus_R100_ndcg@10	NanoNQ_R100_ndcg@10	NanoBEIR_R100_mean_ndcg@10
-1	-1	-	-	0.0234 (-0.5170)	0.3412 (+0.0161)	0.0321 (-0.4686)	0.1322 (-0.3231)
0.0000	1	0.8349	-	-	-	-	-
0.0040	200	0.8417	-	-	-	-	-
...	...	...	...	...	...	...	...
0.8014	40000	0.1381	0.1289	0.7251 (+0.1847)	0.4143 (+0.0892)	0.7594 (+0.2587)	0.6329 (+0.1776)
...	...	...	...	...	...	...	...

太字の行は保存されたチェックポイントを表します。

📄 ライセンス

このモデルはapache-2.0ライセンスを使用しています。

引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

LambdaLoss

@inproceedings{wang2018lambdaloss,
  title={The lambdaloss framework for ranking metric optimization},
  author={Wang, Xuanhui and Li, Cheng and Golbandi, Nadav and Bendersky, Michael and Najork, Marc},
  booktitle={Proceedings of the 27th ACM international conference on information and knowledge management},
  pages={1313--1322},
  year={2018}
}