reranker-msmarco-ModernBERT-base-lambdaloss开源模型 - 文本重排序和语义搜索利器

首页

Reranker Msmarco ModernBERT Base Lambdaloss

由 tomaarsen 开发

这是一个从ModernBERT-base微调而来的交叉编码器模型，用于计算文本对的分数，适用于文本重排序和语义搜索任务。

文本嵌入

Safetensors

英语开源协议:Apache-2.0 #文本重排序 #语义搜索 #高精度评分

下载量 89

发布时间 : 3/17/2025

模型简介

该模型基于ModernBERT-base架构，使用sentence-transformers库在msmarco数据集上训练，专门用于计算文本对的相似度分数，可应用于信息检索、问答系统等场景。

模型特点

高效文本重排序

能够快速计算文本对的相似度分数，有效提升检索系统的排序质量

大序列长度支持

支持最大8192个标记的序列长度，适合处理长文本

高性能指标

在多个评估数据集上表现出色，如NanoMSMARCO_R100上ndcg@10达到0.7251

模型能力

文本相似度计算

信息检索结果重排序

问答系统答案排序

语义搜索

使用案例

信息检索

搜索引擎结果重排序

对搜索引擎返回的结果进行二次排序，提高相关文档的排名

在MSMARCO数据集上map达到0.6768

问答系统

答案相关性排序

对候选答案进行相关性评分，选择最相关的答案

在NanoNQ_R100数据集上mrr@10达到0.7402

🚀 基于answerdotai/ModernBERT-base的交叉编码器

本模型是基于answerdotai/ModernBERT-base的交叉编码器，在msmarco数据集上使用sentence-transformers库进行微调。它可以计算文本对的得分，可用于文本重排序和语义搜索。

🚀 快速开始

直接使用（Sentence Transformers）

首先安装Sentence Transformers库：

pip install -U sentence-transformers

然后你可以加载这个模型并进行推理。

from sentence_transformers import CrossEncoder

# 从🤗 Hub下载模型
model = CrossEncoder("tomaarsen/reranker-msmarco-ModernBERT-base-lambdaloss")
# 获取文本对的得分
pairs = [
    ['How many calories in an egg', 'There are on average between 55 and 80 calories in an egg depending on its size.'],
    ['How many calories in an egg', 'Egg whites are very low in calories, have no fat, no cholesterol, and are loaded with protein.'],
    ['How many calories in an egg', 'Most of the calories in an egg come from the yellow yolk in the center.'],
]
scores = model.predict(pairs)
print(scores.shape)
# (3,)

# 或者根据与单个文本的相似度对不同文本进行排序
ranks = model.rank(
    'How many calories in an egg',
    [
        'There are on average between 55 and 80 calories in an egg depending on its size.',
        'Egg whites are very low in calories, have no fat, no cholesterol, and are loaded with protein.',
        'Most of the calories in an egg come from the yellow yolk in the center.',
    ]
)
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]

✨ 主要特性

基于answerdotai/ModernBERT-base模型进行微调，具有良好的文本处理能力。
能够计算文本对的得分，可用于文本重排序和语义搜索。
支持最大长度为8192个标记的输入序列。

📦 安装指南

安装Sentence Transformers库：

pip install -U sentence-transformers

💻 使用示例

基础用法

from sentence_transformers import CrossEncoder

# 从🤗 Hub下载模型
model = CrossEncoder("tomaarsen/reranker-msmarco-ModernBERT-base-lambdaloss")
# 获取文本对的得分
pairs = [
    ['How many calories in an egg', 'There are on average between 55 and 80 calories in an egg depending on its size.'],
    ['How many calories in an egg', 'Egg whites are very low in calories, have no fat, no cholesterol, and are loaded with protein.'],
    ['How many calories in an egg', 'Most of the calories in an egg come from the yellow yolk in the center.'],
]
scores = model.predict(pairs)
print(scores.shape)
# (3,)

高级用法

from sentence_transformers import CrossEncoder

# 从🤗 Hub下载模型
model = CrossEncoder("tomaarsen/reranker-msmarco-ModernBERT-base-lambdaloss")
# 根据与单个文本的相似度对不同文本进行排序
ranks = model.rank(
    'How many calories in an egg',
    [
        'There are on average between 55 and 80 calories in an egg depending on its size.',
        'Egg whites are very low in calories, have no fat, no cholesterol, and are loaded with protein.',
        'Most of the calories in an egg come from the yellow yolk in the center.',
    ]
)
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]

📚 详细文档

模型详情

模型描述

属性	详情
模型类型	交叉编码器
基础模型	answerdotai/ModernBERT-base
最大序列长度	8192个标记
输出标签数量	1个标签
训练数据集	msmarco
语言	英语

模型来源

评估

指标

交叉编码器重排序

数据集：NanoMSMARCO_R100、NanoNFCorpus_R100和NanoNQ_R100

使用CrossEncoderRerankingEvaluator进行评估，参数如下：

{
    "at_k": 10,
    "always_rerank_positives": true
}

指标	NanoMSMARCO_R100	NanoNFCorpus_R100	NanoNQ_R100
map	0.6768 (+0.1872)	0.3576 (+0.0966)	0.7134 (+0.2938)
mrr@10	0.6690 (+0.1915)	0.5819 (+0.0820)	0.7402 (+0.3135)
ndcg@10	0.7251 (+0.1847)	0.4143 (+0.0892)	0.7594 (+0.2587)

交叉编码器Nano BEIR

数据集：NanoBEIR_R100_mean

使用CrossEncoderNanoBEIREvaluator进行评估，参数如下：

{
    "dataset_names": [
        "msmarco",
        "nfcorpus",
        "nq"
    ],
    "rerank_k": 100,
    "at_k": 10,
    "always_rerank_positives": true
}

指标	值
map	0.5826 (+0.1925)
mrr@10	0.6637 (+0.1957)
ndcg@10	0.6329 (+0.1776)

训练详情

训练数据集

数据集：msmarco（版本：a0537b6）
大小：399,282个训练样本
列：query_id、doc_ids和labels

评估数据集

数据集：msmarco（版本：a0537b6）
大小：1,000个评估样本
列：query_id、doc_ids和labels

训练超参数

非默认超参数：
- eval_strategy: steps
- num_train_epochs: 1
- warmup_ratio: 0.1
- seed: 12
- bf16: True
- load_best_model_at_end: True

框架版本

Python: 3.11.10
Sentence Transformers: 3.5.0.dev0
Transformers: 4.49.0
PyTorch: 2.5.1+cu124
Accelerate: 1.2.0
Datasets: 2.21.0
Tokenizers: 0.21.0

🔧 技术细节

损失函数

使用LambdaLoss损失函数，参数如下：

{
    "weighting_scheme": "sentence_transformers.cross_encoder.losses.LambdaLoss.NDCGLoss2PPScheme",
    "k": null,
    "sigma": 1.0,
    "eps": 1e-10,
    "reduction_log": "binary",
    "activation_fct": "torch.nn.modules.linear.Identity",
    "mini_batch_size": 8
}

训练日志

点击展开

轮次	步数	训练损失	验证损失	NanoMSMARCO_R100_ndcg@10	NanoNFCorpus_R100_ndcg@10	NanoNQ_R100_ndcg@10	NanoBEIR_R100_mean_ndcg@10
-1	-1	-	-	0.0234 (-0.5170)	0.3412 (+0.0161)	0.0321 (-0.4686)	0.1322 (-0.3231)
0.0000	1	0.8349	-	-	-	-	-
0.0040	200	0.8417	-	-	-	-	-
...	...	...	...	...	...	...	...
0.8014	40000	0.1381	0.1289	0.7251 (+0.1847)	0.4143 (+0.0892)	0.7594 (+0.2587)	0.6329 (+0.1776)
...	...	...	...	...	...	...	...

加粗行表示保存的检查点。

📄 许可证

本模型使用apache-2.0许可证。

引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

LambdaLoss

@inproceedings{wang2018lambdaloss,
  title={The lambdaloss framework for ranking metric optimization},
  author={Wang, Xuanhui and Li, Cheng and Golbandi, Nadav and Bendersky, Michael and Najork, Marc},
  booktitle={Proceedings of the 27th ACM international conference on information and knowledge management},
  pages={1313--1322},
  year={2018}
}