ft-ms-marco-MiniLM-L12-v2-claims-reranker-v2开源模型 - 实现文本重排序与语义搜索

首页

Ft Ms Marco MiniLM L12 V2 Claims Reranker V2

由 Davidsamuel101 开发

这是一个基于cross-encoder/ms-marco-MiniLM-L12-v2微调的交叉编码器模型，用于文本重排序和语义搜索。

文本嵌入

Safetensors

#主张证据重排序 #高精度语义匹配 #科学文本分析

下载量 769

发布时间 : 5/16/2025

模型简介

该模型计算文本对的分数，可用于文本重排序和语义搜索任务。

模型特点

高效文本重排序

能够高效地对文本对进行评分和重排序，适用于语义搜索场景。

高精度性能

在主张证据开发集上表现出色，平均精度达到0.9904。

基于MiniLM架构

基于高效的MiniLM架构，平衡了性能和计算资源需求。

模型能力

文本对评分

语义搜索

文本重排序

使用案例

信息检索

主张证据匹配

用于匹配主张和相关的证据文本

在前5命中率达到1.0

搜索引擎重排序

对搜索引擎初步结果进行重排序以提高相关性

🚀 基于 cross-encoder/ms-marco-MiniLM-L12-v2 的交叉编码器

这是一个基于 sentence-transformers 库，从 cross-encoder/ms-marco-MiniLM-L12-v2 微调而来的交叉编码器模型。它可以计算文本对的得分，可用于文本重排序和语义搜索。

🚀 快速开始

本模型可用于计算文本对的得分，进而实现文本重排序和语义搜索。下面将详细介绍使用方法。

✨ 主要特性

基于 cross-encoder/ms-marco-MiniLM-L12-v2 微调，能够精准计算文本对得分。
可用于文本重排序和语义搜索任务。

📦 安装指南

首先，你需要安装 sentence-transformers 库：

pip install -U sentence-transformers

💻 使用示例

基础用法

from sentence_transformers import CrossEncoder

# 从 🤗 Hub 下载模型
model = CrossEncoder("Davidsamuel101/ft-ms-marco-MiniLM-L12-v2-claims-reranker-v2")
# 获取文本对的得分
pairs = [
    ['Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.', 'At very high concentrations (100 times atmospheric concentration, or greater), carbon dioxide can be toxic to animal life, so raising the concentration to 10,000 ppm (1%) or higher for several hours will eliminate pests such as whiteflies and spider mites in a greenhouse.'],
    ['Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.', 'Plants can grow as much as 50 percent faster in concentrations of 1,000 ppm CO 2 when compared with ambient conditions, though this assumes no change in climate and no limitation on other nutrients.'],
    ['Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.', 'Higher carbon dioxide concentrations will favourably affect plant growth and demand for water.'],
    ['Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.', "Carbon dioxide in the Earth's atmosphere is essential to life and to most of the planetary biosphere."],
    ['Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.', 'Rennie 2009: "Claim 1: Anthropogenic CO2 can\'t be changing climate, because CO2 is only a trace gas in the atmosphere and the amount produced by humans is dwarfed by the amount from volcanoes and other natural sources.'],
]
scores = model.predict(pairs)
print(scores.shape)
# (5,)

# 或者根据与单个文本的相似度对不同文本进行排序
ranks = model.rank(
    'Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.',
    [
        'At very high concentrations (100 times atmospheric concentration, or greater), carbon dioxide can be toxic to animal life, so raising the concentration to 10,000 ppm (1%) or higher for several hours will eliminate pests such as whiteflies and spider mites in a greenhouse.',
        'Plants can grow as much as 50 percent faster in concentrations of 1,000 ppm CO 2 when compared with ambient conditions, though this assumes no change in climate and no limitation on other nutrients.',
        'Higher carbon dioxide concentrations will favourably affect plant growth and demand for water.',
        "Carbon dioxide in the Earth's atmosphere is essential to life and to most of the planetary biosphere.",
        'Rennie 2009: "Claim 1: Anthropogenic CO2 can\'t be changing climate, because CO2 is only a trace gas in the atmosphere and the amount produced by humans is dwarfed by the amount from volcanoes and other natural sources.',
    ]
)
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]

📚 详细文档

模型详情

模型描述

属性	详情
模型类型	交叉编码器
基础模型	cross-encoder/ms-marco-MiniLM-L12-v2
最大序列长度	512 个词元
输出标签数量	1 个标签

模型来源

文档：Sentence Transformers 文档
文档：Cross Encoder 文档
仓库：GitHub 上的 Sentence Transformers
Hugging Face：Hugging Face 上的 Cross Encoders

评估

指标

交叉编码器重排序

数据集：claims-evidence-dev

使用 CrossEncoderRerankingEvaluator 进行评估，参数如下：

{
    "at_k": 5,
    "always_rerank_positives": true
}

指标	值
map	0.9904 (-0.0096)
mrr@5	1.0000 (+0.0000)
ndcg@5	0.9882 (-0.0118)

训练详情

训练数据集

未命名数据集

大小：23,770 个训练样本
列：text1、text2 和 label

基于前 1000 个样本的近似统计信息：

	text1	text2	label
类型	字符串	字符串	整数
详情	最小：38 个字符平均：118.57 个字符最大：226 个字符	最小：14 个字符平均：144.96 个字符最大：1176 个字符	0：~83.70% 1：~16.30%

样本：

text1	text2	label
`Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.`	`At very high concentrations (100 times atmospheric concentration, or greater), carbon dioxide can be toxic to animal life, so raising the concentration to 10,000 ppm (1%) or higher for several hours will eliminate pests such as whiteflies and spider mites in a greenhouse.`	`1`
`Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.`	`Plants can grow as much as 50 percent faster in concentrations of 1,000 ppm CO 2 when compared with ambient conditions, though this assumes no change in climate and no limitation on other nutrients.`	`1`
`Not only is there no scientific evidence that CO2 is a pollutant, higher CO2 concentrations actually help ecosystems support more plant and animal life.`	`Higher carbon dioxide concentrations will favourably affect plant growth and demand for water.`	`1`

损失函数：MultipleNegativesRankingLoss，参数如下：

{
    "scale": 10.0,
    "num_negatives": 4,
    "activation_fn": "torch.nn.modules.activation.Sigmoid"
}

训练超参数

非默认超参数

eval_strategy：steps
per_device_train_batch_size：16
learning_rate：3e-06
num_train_epochs：5
bf16：True
load_best_model_at_end：True

所有超参数

点击展开

overwrite_output_dir：False
do_predict：False
eval_strategy：steps
prediction_loss_only：True
per_device_train_batch_size：16
per_device_eval_batch_size：8
per_gpu_train_batch_size：None
per_gpu_eval_batch_size：None
gradient_accumulation_steps：1
eval_accumulation_steps：None
torch_empty_cache_steps：None
learning_rate：3e-06
weight_decay：0.0
adam_beta1：0.9
adam_beta2：0.999
adam_epsilon：1e-08
max_grad_norm：1.0
num_train_epochs：5
max_steps：-1
lr_scheduler_type：linear
lr_scheduler_kwargs：{}
warmup_ratio：0.0
warmup_steps：0
log_level：passive
log_level_replica：warning
log_on_each_node：True
logging_nan_inf_filter：True
save_safetensors：True
save_on_each_node：False
save_only_model：False
restore_callback_states_from_checkpoint：False
no_cuda：False
use_cpu：False
use_mps_device：False
seed：42
data_seed：None
jit_mode_eval：False
use_ipex：False
bf16：True
fp16：False
fp16_opt_level：O1
half_precision_backend：auto
bf16_full_eval：False
fp16_full_eval：False
tf32：None
local_rank：0
ddp_backend：None
tpu_num_cores：None
tpu_metrics_debug：False
debug：[]
dataloader_drop_last：False
dataloader_num_workers：0
dataloader_prefetch_factor：None
past_index：-1
disable_tqdm：False
remove_unused_columns：True
label_names：None
load_best_model_at_end：True
ignore_data_skip：False
fsdp：[]
fsdp_min_num_params：0
fsdp_config：{'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
tp_size：0
fsdp_transformer_layer_cls_to_wrap：None
accelerator_config：{'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed：None
label_smoothing_factor：0.0
optim：adamw_torch
optim_args：None
adafactor：False
group_by_length：False
length_column_name：length
ddp_find_unused_parameters：None
ddp_bucket_cap_mb：None
ddp_broadcast_buffers：False
dataloader_pin_memory：True
dataloader_persistent_workers：False
skip_memory_metrics：True
use_legacy_prediction_loop：False
push_to_hub：False
resume_from_checkpoint：None
hub_model_id：None
hub_strategy：every_save
hub_private_repo：None
hub_always_push：False
gradient_checkpointing：False
gradient_checkpointing_kwargs：None
include_inputs_for_metrics：False
include_for_metrics：[]
eval_do_concat_batches：True
fp16_backend：auto
push_to_hub_model_id：None
push_to_hub_organization：None
mp_parameters：
auto_find_batch_size：False
full_determinism：False
torchdynamo：None
ray_scope：last
ddp_timeout：1800
torch_compile：False
torch_compile_backend：None
torch_compile_mode：None
include_tokens_per_second：False
include_num_input_tokens_seen：False
neftune_noise_alpha：None
optim_target_modules：None
batch_eval_metrics：False
eval_on_start：False
use_liger_kernel：False
eval_use_gather_object：False
average_tokens_across_devices：False
prompts：None
batch_sampler：batch_sampler
multi_dataset_batch_sampler：proportional

训练日志

点击展开

轮次	步数	训练损失	claims-evidence-dev_ndcg@5
0.0336	50	1.2496	-
0.0673	100	1.2605	0.9523 (-0.0477)
0.1009	150	1.1969	-
0.1346	200	1.2353	0.9529 (-0.0471)
0.1682	250	1.2114	-
0.2019	300	1.1438	0.9551 (-0.0449)
0.2355	350	1.2062	-
0.2692	400	1.1631	0.9568 (-0.0432)
0.3028	450	1.115	-
0.3365	500	1.2029	0.9582 (-0.0418)
0.3701	550	1.0615	-
0.4038	600	1.185	0.9649 (-0.0351)
0.4374	650	1.0651	-
0.4711	700	1.0951	0.9682 (-0.0318)
0.5047	750	1.1267	-
0.5384	800	1.0822	0.9727 (-0.0273)
0.5720	850	1.0658	-
0.6057	900	1.0113	0.9785 (-0.0215)
0.6393	950	1.0578	-
0.6729	1000	1.074	0.9829 (-0.0171)
0.7066	1050	1.0287	-
0.7402	1100	0.9337	0.9873 (-0.0127)
0.7739	1150	0.9798	-
0.8075	1200	0.9697	0.9899 (-0.0101)
0.8412	1250	0.984	-
0.8748	1300	0.9913	0.9898 (-0.0102)
0.9085	1350	1.0126	-
0.9421	1400	0.9458	0.9897 (-0.0103)
0.9758	1450	0.9594	-
1.0094	1500	0.9798	0.9896 (-0.0104)
1.0431	1550	0.9599	-
1.0767	1600	0.9485	0.9887 (-0.0113)
1.1104	1650	0.9021	-
1.1440	1700	0.9778	0.9887 (-0.0113)
1.1777	1750	0.9836	-
1.2113	1800	0.939	0.9912 (-0.0088)
1.2450	1850	0.9476	-
1.2786	1900	0.964	0.9914 (-0.0086)
1.3122	1950	0.9238	-
1.3459	2000	0.9811	0.9895 (-0.0105)
1.3795	2050	0.905	-
1.4132	2100	0.8979	0.9896 (-0.0104)
1.4468	2150	0.8998	-
1.4805	2200	0.9016	0.9896 (-0.0104)
1.5141	2250	0.9183	-
1.5478	2300	0.8805	0.9896 (-0.0104)
1.5814	2350	0.8672	-
1.6151	2400	0.8822	0.9896 (-0.0104)
1.6487	2450	0.8724	-
1.6824	2500	0.9397	0.9883 (-0.0117)
1.7160	2550	0.8903	-
1.7497	2600	0.9305	0.9882 (-0.0118)
1.7833	2650	0.8741	-
1.8170	2700	0.8951	0.9874 (-0.0126)
1.8506	2750	0.8958	-
1.8843	2800	0.8529	0.9873 (-0.0127)
1.9179	2850	0.9468	-
1.9515	2900	0.8683	0.9882 (-0.0118)
1.9852	2950	0.9145	-
2.0188	3000	0.9137	0.9883 (-0.0117)
2.0525	3050	0.8175	-
2.0861	3100	0.911	0.9883 (-0.0117)
2.1198	3150	0.8749	-
2.1534	3200	0.8491	0.9883 (-0.0117)
2.1871	3250	0.9057	-
2.2207	3300	0.9034	0.9882 (-0.0118)
2.2544	3350	0.8505	-
2.2880	3400	0.8762	0.9883 (-0.0117)
2.3217	3450	0.8974	-
2.3553	3500	0.8832	0.9884 (-0.0116)
2.3890	3550	0.851	-
2.4226	3600	0.8584	0.9890 (-0.0110)
2.4563	3650	0.9032	-
2.4899	3700	0.8963	0.9893 (-0.0107)
2.5236	3750	0.8756	-
2.5572	3800	0.843	0.9882 (-0.0118)
2.5908	3850	0.8778	-
2.6245	3900	0.8434	0.9882 (-0.0118)
2.6581	3950	0.9193	-
2.6918	4000	0.8724	0.9875 (-0.0125)
2.7254	4050	0.9062	-
2.7591	4100	0.8807	0.9875 (-0.0125)
2.7927	4150	0.8252	-
2.8264	4200	0.8725	0.9875 (-0.0125)
2.8600	4250	0.9094	-
2.8937	4300	0.8589	0.9874 (-0.0126)
2.9273	4350	0.8625	-
2.9610	4400	0.8138	0.9874 (-0.0126)
2.9946	4450	0.9217	-
3.0283	4500	0.8871	0.9872 (-0.0128)
3.0619	4550	0.8504	-
3.0956	4600	0.944	0.9873 (-0.0127)
3.1292	4650	0.8258	-
3.1629	4700	0.9054	0.9874 (-0.0126)
3.1965	4750	0.8297	-
3.2301	4800	0.8483	0.9875 (-0.0125)
3.2638	4850	0.909	-
3.2974	4900	0.8486	0.9892 (-0.0108)
3.3311	4950	0.8937	-
3.3647	5000	0.8821	0.9874 (-0.0126)
3.3984	5050	0.873	-
3.4320	5100	0.8773	0.9874 (-0.0126)
3.4657	5150	0.8592	-
3.4993	5200	0.8449	0.9882 (-0.0118)
3.5330	5250	0.8651	-
3.5666	5300	0.8943	0.9882 (-0.0118)
3.6003	5350	0.8535	-
3.6339	5400	0.8687	0.9882 (-0.0118)
3.6676	5450	0.9213	-
3.7012	5500	0.887	0.9882 (-0.0118)
3.7349	5550	0.8787	-
3.7685	5600	0.8466	0.9882 (-0.0118)
3.8022	5650	0.8517	-
3.8358	5700	0.8349	0.9883 (-0.0117)
3.8694	5750	0.8647	-
3.9031	5800	0.8406	0.9882 (-0.0118)
3.9367	5850	0.8385	-
3.9704	5900	0.8631	0.9882 (-0.0118)
4.0040	5950	0.823	-
4.0377	6000	0.9163	0.9881 (-0.0119)
4.0713	6050	0.8373	-
4.1050	6100	0.892	0.9882 (-0.0118)
4.1386	6150	0.8666	-
4.1723	6200	0.8536	0.9882 (-0.0118)
4.2059	6250	0.8784	-
4.2396	6300	0.9616	0.9882 (-0.0118)
4.2732	6350	0.8464	-
4.3069	6400	0.865	0.9882 (-0.0118)
4.3405	6450	0.8411	-
4.3742	6500	0.8943	0.9882 (-0.0118)
4.4078	6550	0.8577	-
4.4415	6600	0.8683	0.9882 (-0.0118)
4.4751	6650	0.8706	-
4.5087	6700	0.8645	0.9882 (-0.0118)
4.5424	6750	0.8899	-
4.5760	6800	0.8593	0.9882 (-0.0118)
4.6097	6850	0.8838	-
4.6433	6900	0.8379	0.9882 (-0.0118)
4.6770	6950	0.8759	-
4.7106	7000	0.8608	0.9882 (-0.0118)
4.7443	7050	0.8858	-
4.7779	7100	0.8594	0.9882 (-0.0118)
4.8116	7150	0.8403	-
4.8452	7200	0.8898	0.9882 (-0.0118)
4.8789	7250	0.8382	-
4.9125	7300	0.8307	0.9882 (-0.0118)
4.9462	7350	0.8601	-
4.9798	7400	0.8076	0.9882 (-0.0118)

加粗行表示保存的检查点。

框架版本

Python：3.13.2
Sentence Transformers：4.1.0
Transformers：4.51.3
PyTorch：2.7.0+cu128
Accelerate：1.6.0
Datasets：3.6.0
Tokenizers：0.21.1

📄 引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}