reranker-msmarco-ModernBERT-base-lambdaloss開源模型 - 文本重排序和語義搜索利器

Home

Reranker Msmarco ModernBERT Base Lambdaloss

Developed by tomaarsen

這是一個從ModernBERT-base微調而來的交叉編碼器模型，用於計算文本對的分數，適用於文本重排序和語義搜索任務。

文本嵌入

Safetensors

EnglishOpen Source License:Apache-2.0 #文本重排序 #語義搜索 #高精度評分

Downloads 89

Release Time : 3/17/2025

Model Overview

該模型基於ModernBERT-base架構，使用sentence-transformers庫在msmarco數據集上訓練，專門用於計算文本對的相似度分數，可應用於信息檢索、問答系統等場景。

Model Features

高效文本重排序

能夠快速計算文本對的相似度分數，有效提升檢索系統的排序質量

大序列長度支持

支持最大8192個標記的序列長度，適合處理長文本

高性能指標

在多個評估數據集上表現出色，如NanoMSMARCO_R100上ndcg@10達到0.7251

Model Capabilities

文本相似度計算

信息檢索結果重排序

問答系統答案排序

語義搜索

Use Cases

信息檢索

搜索引擎結果重排序

對搜索引擎返回的結果進行二次排序，提高相關文檔的排名

在MSMARCO數據集上map達到0.6768

問答系統

答案相關性排序

對候選答案進行相關性評分，選擇最相關的答案

在NanoNQ_R100數據集上mrr@10達到0.7402

🚀 基於answerdotai/ModernBERT-base的交叉編碼器

本模型是基於answerdotai/ModernBERT-base的交叉編碼器，在msmarco數據集上使用sentence-transformers庫進行微調。它可以計算文本對的得分，可用於文本重排序和語義搜索。

🚀 快速開始

直接使用（Sentence Transformers）

首先安裝Sentence Transformers庫：

pip install -U sentence-transformers

然後你可以加載這個模型並進行推理。

from sentence_transformers import CrossEncoder

# 從🤗 Hub下載模型
model = CrossEncoder("tomaarsen/reranker-msmarco-ModernBERT-base-lambdaloss")
# 獲取文本對的得分
pairs = [
    ['How many calories in an egg', 'There are on average between 55 and 80 calories in an egg depending on its size.'],
    ['How many calories in an egg', 'Egg whites are very low in calories, have no fat, no cholesterol, and are loaded with protein.'],
    ['How many calories in an egg', 'Most of the calories in an egg come from the yellow yolk in the center.'],
]
scores = model.predict(pairs)
print(scores.shape)
# (3,)

# 或者根據與單個文本的相似度對不同文本進行排序
ranks = model.rank(
    'How many calories in an egg',
    [
        'There are on average between 55 and 80 calories in an egg depending on its size.',
        'Egg whites are very low in calories, have no fat, no cholesterol, and are loaded with protein.',
        'Most of the calories in an egg come from the yellow yolk in the center.',
    ]
)
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]

✨ 主要特性

基於answerdotai/ModernBERT-base模型進行微調，具有良好的文本處理能力。
能夠計算文本對的得分，可用於文本重排序和語義搜索。
支持最大長度為8192個標記的輸入序列。

📦 安裝指南

安裝Sentence Transformers庫：

pip install -U sentence-transformers

💻 使用示例

基礎用法

from sentence_transformers import CrossEncoder

# 從🤗 Hub下載模型
model = CrossEncoder("tomaarsen/reranker-msmarco-ModernBERT-base-lambdaloss")
# 獲取文本對的得分
pairs = [
    ['How many calories in an egg', 'There are on average between 55 and 80 calories in an egg depending on its size.'],
    ['How many calories in an egg', 'Egg whites are very low in calories, have no fat, no cholesterol, and are loaded with protein.'],
    ['How many calories in an egg', 'Most of the calories in an egg come from the yellow yolk in the center.'],
]
scores = model.predict(pairs)
print(scores.shape)
# (3,)

高級用法

from sentence_transformers import CrossEncoder

# 從🤗 Hub下載模型
model = CrossEncoder("tomaarsen/reranker-msmarco-ModernBERT-base-lambdaloss")
# 根據與單個文本的相似度對不同文本進行排序
ranks = model.rank(
    'How many calories in an egg',
    [
        'There are on average between 55 and 80 calories in an egg depending on its size.',
        'Egg whites are very low in calories, have no fat, no cholesterol, and are loaded with protein.',
        'Most of the calories in an egg come from the yellow yolk in the center.',
    ]
)
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]

📚 詳細文檔

模型詳情

模型描述

屬性	詳情
模型類型	交叉編碼器
基礎模型	answerdotai/ModernBERT-base
最大序列長度	8192個標記
輸出標籤數量	1個標籤
訓練數據集	msmarco
語言	英語

模型來源

評估

指標

交叉編碼器重排序

數據集：NanoMSMARCO_R100、NanoNFCorpus_R100和NanoNQ_R100

使用CrossEncoderRerankingEvaluator進行評估，參數如下：

{
    "at_k": 10,
    "always_rerank_positives": true
}

指標	NanoMSMARCO_R100	NanoNFCorpus_R100	NanoNQ_R100
map	0.6768 (+0.1872)	0.3576 (+0.0966)	0.7134 (+0.2938)
mrr@10	0.6690 (+0.1915)	0.5819 (+0.0820)	0.7402 (+0.3135)
ndcg@10	0.7251 (+0.1847)	0.4143 (+0.0892)	0.7594 (+0.2587)

交叉編碼器Nano BEIR

數據集：NanoBEIR_R100_mean

使用CrossEncoderNanoBEIREvaluator進行評估，參數如下：

{
    "dataset_names": [
        "msmarco",
        "nfcorpus",
        "nq"
    ],
    "rerank_k": 100,
    "at_k": 10,
    "always_rerank_positives": true
}

指標	值
map	0.5826 (+0.1925)
mrr@10	0.6637 (+0.1957)
ndcg@10	0.6329 (+0.1776)

訓練詳情

訓練數據集

數據集：msmarco（版本：a0537b6）
大小：399,282個訓練樣本
列：query_id、doc_ids和labels

評估數據集

數據集：msmarco（版本：a0537b6）
大小：1,000個評估樣本
列：query_id、doc_ids和labels

訓練超參數

非默認超參數：
- eval_strategy: steps
- num_train_epochs: 1
- warmup_ratio: 0.1
- seed: 12
- bf16: True
- load_best_model_at_end: True

框架版本

Python: 3.11.10
Sentence Transformers: 3.5.0.dev0
Transformers: 4.49.0
PyTorch: 2.5.1+cu124
Accelerate: 1.2.0
Datasets: 2.21.0
Tokenizers: 0.21.0

🔧 技術細節

損失函數

使用LambdaLoss損失函數，參數如下：

{
    "weighting_scheme": "sentence_transformers.cross_encoder.losses.LambdaLoss.NDCGLoss2PPScheme",
    "k": null,
    "sigma": 1.0,
    "eps": 1e-10,
    "reduction_log": "binary",
    "activation_fct": "torch.nn.modules.linear.Identity",
    "mini_batch_size": 8
}

訓練日誌

點擊展開

輪次	步數	訓練損失	驗證損失	NanoMSMARCO_R100_ndcg@10	NanoNFCorpus_R100_ndcg@10	NanoNQ_R100_ndcg@10	NanoBEIR_R100_mean_ndcg@10
-1	-1	-	-	0.0234 (-0.5170)	0.3412 (+0.0161)	0.0321 (-0.4686)	0.1322 (-0.3231)
0.0000	1	0.8349	-	-	-	-	-
0.0040	200	0.8417	-	-	-	-	-
...	...	...	...	...	...	...	...
0.8014	40000	0.1381	0.1289	0.7251 (+0.1847)	0.4143 (+0.0892)	0.7594 (+0.2587)	0.6329 (+0.1776)
...	...	...	...	...	...	...	...

加粗行表示保存的檢查點。

📄 許可證

本模型使用apache-2.0許可證。

引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

LambdaLoss

@inproceedings{wang2018lambdaloss,
  title={The lambdaloss framework for ranking metric optimization},
  author={Wang, Xuanhui and Li, Cheng and Golbandi, Nadav and Bendersky, Michael and Najork, Marc},
  booktitle={Proceedings of the 27th ACM international conference on information and knowledge management},
  pages={1313--1322},
  year={2018}
}