finetuned-ce-climate-multineg-v1開源模型 - 助力氣候文本重排序與語義搜索

首頁

Finetuned Ce Climate Multineg V1

由CharlesPing開發

這是一個從cross-encoder/ms-marco-MiniLM-L12-v2微調而來的交叉編碼器模型，專門用於氣候相關文本的重排序和語義搜索任務。

文本嵌入

Safetensors

#氣候文本重排序 #高精度語義匹配 #科學文獻檢索

下載量 19

發布時間 : 5/17/2025

模型概述

該模型計算文本對的分數，可用於文本重排序和語義搜索，特別針對氣候科學領域的文本優化。

模型特點

氣候領域優化

專門針對氣候科學領域的文本進行優化，能夠更好地理解相關術語和概念。

高效重排序

能夠快速計算文本對的相似度分數，適用於大規模文檔的重排序任務。

多負樣本訓練

使用混合負樣本訓練策略，提高了模型區分相關和不相關文本的能力。

模型能力

文本相似度計算

語義搜索

文檔重排序

氣候領域文本理解

使用案例

信息檢索

氣候科學文獻檢索

在氣候科學文獻數據庫中對搜索結果進行重排序，提高相關文檔的排名。

首位歸一化折損累積增益達到0.6748

問答系統

氣候相關問題回答

在問答系統中用於評估候選答案與問題的相關性。

🚀 基於 cross-encoder/ms-marco-MiniLM-L12-v2 的交叉編碼器

這是一個基於 Cross Encoder 的模型，它在 climate-cross-encoder-mixed-neg-v3 數據集上，使用 sentence-transformers 庫對 cross-encoder/ms-marco-MiniLM-L12-v2 進行微調得到。該模型可以為文本對計算分數，可用於文本重排序和語義搜索。

🚀 快速開始

本模型可用於計算文本對的分數，進而實現文本重排序和語義搜索。下面將詳細介紹如何使用該模型。

✨ 主要特性

基於 cross-encoder/ms-marco-MiniLM-L12-v2 進行微調，在 climate-cross-encoder-mixed-neg-v3 數據集上訓練。
能夠計算文本對的分數，用於文本重排序和語義搜索。
支持使用 sentence-transformers 庫進行推理和微調。

📦 安裝指南

首先，你需要安裝 sentence-transformers 庫：

pip install -U sentence-transformers

💻 使用示例

基礎用法

以下是如何加載模型並進行推理的示例：

from sentence_transformers import CrossEncoder

# 從 Hugging Face Hub 下載模型
model = CrossEncoder("CharlesPing/finetuned-ce-climate-multineg-v1")
# 獲取文本對的分數
pairs = [
    ['Scientific analysis of past climates\xa0shows that greenhouse gasses, principally CO2,\xa0have controlled most ancient\xa0climate changes.', 'Greenhouse gases, in particular carbon dioxide and methane, played a significant role during the Eocene in controlling the surface temperature.'],
    ['Scientific analysis of past climates\xa0shows that greenhouse gasses, principally CO2,\xa0have controlled most ancient\xa0climate changes.', 'Climatic geomorphology is of limited use to study recent (Quaternary, Holocene) large climate changes since there are seldom discernible in the geomorphological record.'],
    ['Scientific analysis of past climates\xa0shows that greenhouse gasses, principally CO2,\xa0have controlled most ancient\xa0climate changes.', 'There is also a close correlation between CO2 and temperature, where CO2 has a strong control over global temperatures in Earth history.'],
    ['Scientific analysis of past climates\xa0shows that greenhouse gasses, principally CO2,\xa0have controlled most ancient\xa0climate changes.', 'While scientists knew of past climate change such as the ice ages, the concept of climate as unchanging was useful in the development of a general theory of what determines climate.'],
    ['Scientific analysis of past climates\xa0shows that greenhouse gasses, principally CO2,\xa0have controlled most ancient\xa0climate changes.', 'Some long term modifications along the history of the planet have been significant, such as the incorporation of oxygen to the atmosphere.'],
]
scores = model.predict(pairs)
print(scores.shape)
# (5,)

# 或者根據與單個文本的相似度對不同文本進行排序
ranks = model.rank(
    'Scientific analysis of past climates\xa0shows that greenhouse gasses, principally CO2,\xa0have controlled most ancient\xa0climate changes.',
    [
        'Greenhouse gases, in particular carbon dioxide and methane, played a significant role during the Eocene in controlling the surface temperature.',
        'Climatic geomorphology is of limited use to study recent (Quaternary, Holocene) large climate changes since there are seldom discernible in the geomorphological record.',
        'There is also a close correlation between CO2 and temperature, where CO2 has a strong control over global temperatures in Earth history.',
        'While scientists knew of past climate change such as the ice ages, the concept of climate as unchanging was useful in the development of a general theory of what determines climate.',
        'Some long term modifications along the history of the planet have been significant, such as the incorporation of oxygen to the atmosphere.',
    ]
)
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]

📚 詳細文檔

模型詳情

模型描述

屬性	詳情
模型類型	交叉編碼器
基礎模型	cross-encoder/ms-marco-MiniLM-L12-v2
最大序列長度	512 個標記
輸出標籤數量	1 個標籤
訓練數據集	climate-cross-encoder-mixed-neg-v3

模型資源

文檔：Sentence Transformers 文檔
文檔：Cross Encoder 文檔
代碼倉庫：GitHub 上的 Sentence Transformers
Hugging Face：Hugging Face 上的交叉編碼器

評估

指標

交叉編碼器重排序

數據集：climate-rerank-multineg

使用 CrossEncoderRerankingEvaluator 進行評估，參數如下：

{
    "at_k": 1,
    "always_rerank_positives": false
}

指標	值
map	0.6809 (-0.3191)
mrr@1	0.6748 (-0.3252)
ndcg@1	0.6748 (-0.3252)

訓練詳情

訓練數據集

climate-cross-encoder-mixed-neg-v3

數據集：climate-cross-encoder-mixed-neg-v3，版本為 cd49b57
大小：41,052 個訓練樣本
列：query、doc 和 label

基於前 1000 個樣本的近似統計信息：

	查詢	文檔	標籤
類型	字符串	字符串	浮點數
詳情	最小長度：49 個字符平均長度：140.03 個字符最大長度：306 個字符	最小長度：4 個字符平均長度：136.03 個字符最大長度：731 個字符	最小值：0.0 平均值：0.09 最大值：1.0

樣本：

查詢	文檔	標籤
`“A leading Canadian authority on polar bears, Mitch Taylor, said: ‘We’re seeing an increase in bears that’s really unprecedented, and in places where we’re seeing a decrease in the population`	`Warnings about the future of the polar bear are often contrasted with the fact that worldwide population estimates have increased over the past 50 years and are relatively stable today.`	`1.0`
`“A leading Canadian authority on polar bears, Mitch Taylor, said: ‘We’re seeing an increase in bears that’s really unprecedented, and in places where we’re seeing a decrease in the population`	`Species distribution models of recent years indicate that the deer tick, known as "I. scapularis," is pushing its distribution to higher latitudes of the Northeastern United States and Canada, as well as pushing and maintaining populations in the South Central and Northern Midwest regions of the United States.`	`0.0`
`“A leading Canadian authority on polar bears, Mitch Taylor, said: ‘We’re seeing an increase in bears that’s really unprecedented, and in places where we’re seeing a decrease in the population`	`Bear and deer are among the animals present.`	`0.0`

損失函數：BinaryCrossEntropyLoss，參數如下：

{
    "activation_fn": "torch.nn.modules.linear.Identity",
    "pos_weight": null
}

評估數據集

climate-cross-encoder-mixed-neg-v3

數據集：climate-cross-encoder-mixed-neg-v3，版本為 cd49b57
大小：4,290 個評估樣本
列：query、doc 和 label

基於前 1000 個樣本的近似統計信息：

	查詢	文檔	標籤
類型	字符串	字符串	浮點數
詳情	最小長度：39 個字符平均長度：116.67 個字符最大長度：240 個字符	最小長度：18 個字符平均長度：132.92 個字符最大長度：731 個字符	最小值：0.0 平均值：0.09 最大值：1.0

樣本：

查詢	文檔	標籤
`Scientific analysis of past climatesÂ shows that greenhouse gasses, principally CO2,Â have controlled most ancientÂ climate changes.`	`Greenhouse gases, in particular carbon dioxide and methane, played a significant role during the Eocene in controlling the surface temperature.`	`1.0`
`Scientific analysis of past climatesÂ shows that greenhouse gasses, principally CO2,Â have controlled most ancientÂ climate changes.`	`Climatic geomorphology is of limited use to study recent (Quaternary, Holocene) large climate changes since there are seldom discernible in the geomorphological record.`	`0.0`
`Scientific analysis of past climatesÂ shows that greenhouse gasses, principally CO2,Â have controlled most ancientÂ climate changes.`	`There is also a close correlation between CO2 and temperature, where CO2 has a strong control over global temperatures in Earth history.`	`0.0`

損失函數：BinaryCrossEntropyLoss，參數如下：

{
    "activation_fn": "torch.nn.modules.linear.Identity",
    "pos_weight": null
}

訓練超參數

非默認超參數

eval_strategy：steps
per_device_train_batch_size：16
per_device_eval_batch_size：32
learning_rate：2e-05
warmup_ratio：0.1
fp16：True
load_best_model_at_end：True

所有超參數

點擊展開

overwrite_output_dir：False
do_predict：False
eval_strategy：steps
prediction_loss_only：True
per_device_train_batch_size：16
per_device_eval_batch_size：32
per_gpu_train_batch_size：None
per_gpu_eval_batch_size：None
gradient_accumulation_steps：1
eval_accumulation_steps：None
torch_empty_cache_steps：None
learning_rate：2e-05
weight_decay：0.0
adam_beta1：0.9
adam_beta2：0.999
adam_epsilon：1e-08
max_grad_norm：1.0
num_train_epochs：3
max_steps：-1
lr_scheduler_type：linear
lr_scheduler_kwargs：{}
warmup_ratio：0.1
warmup_steps：0
log_level：passive
log_level_replica：warning
log_on_each_node：True
logging_nan_inf_filter：True
save_safetensors：True
save_on_each_node：False
save_only_model：False
restore_callback_states_from_checkpoint：False
no_cuda：False
use_cpu：False
use_mps_device：False
seed：42
data_seed：None
jit_mode_eval：False
use_ipex：False
bf16：False
fp16：True
fp16_opt_level：O1
half_precision_backend：auto
bf16_full_eval：False
fp16_full_eval：False
tf32：None
local_rank：0
ddp_backend：None
tpu_num_cores：None
tpu_metrics_debug：False
debug：[]
dataloader_drop_last：False
dataloader_num_workers：0
dataloader_prefetch_factor：None
past_index：-1
disable_tqdm：False
remove_unused_columns：True
label_names：None
load_best_model_at_end：True
ignore_data_skip：False
fsdp：[]
fsdp_min_num_params：0
fsdp_config：{'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
tp_size：0
fsdp_transformer_layer_cls_to_wrap：None
accelerator_config：{'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed：None
label_smoothing_factor：0.0
optim：adamw_torch
optim_args：None
adafactor：False
group_by_length：False
length_column_name：length
ddp_find_unused_parameters：None
ddp_bucket_cap_mb：None
ddp_broadcast_buffers：False
dataloader_pin_memory：True
dataloader_persistent_workers：False
skip_memory_metrics：True
use_legacy_prediction_loop：False
push_to_hub：False
resume_from_checkpoint：None
hub_model_id：None
hub_strategy：every_save
hub_private_repo：None
hub_always_push：False
gradient_checkpointing：False
gradient_checkpointing_kwargs：None
include_inputs_for_metrics：False
include_for_metrics：[]
eval_do_concat_batches：True
fp16_backend：auto
push_to_hub_model_id：None
push_to_hub_organization：None
mp_parameters：
auto_find_batch_size：False
full_determinism：False
torchdynamo：None
ray_scope：last
ddp_timeout：1800
torch_compile：False
torch_compile_backend：None
torch_compile_mode：None
include_tokens_per_second：False
include_num_input_tokens_seen：False
neftune_noise_alpha：None
optim_target_modules：None
batch_eval_metrics：False
eval_on_start：False
use_liger_kernel：False
eval_use_gather_object：False
average_tokens_across_devices：False
prompts：None
batch_sampler：batch_sampler
multi_dataset_batch_sampler：proportional

訓練日誌

輪次	步數	訓練損失	驗證損失	climate-rerank-multineg_ndcg@1
0.0390	100	0.5097	-	-
0.0779	200	0.3662	-	-
0.1169	300	0.3034	-	-
0.1559	400	0.2655	-	-
0.1949	500	0.2651	0.2262	0.6585 (-0.3415)
0.2338	600	0.2161	-	-
0.2728	700	0.227	-	-
0.3118	800	0.235	-	-
0.3507	900	0.2243	-	-
0.3897	1000	0.2081	0.2174	0.6992 (-0.3008)
0.4287	1100	0.1961	-	-
0.4677	1200	0.207	-	-
0.5066	1300	0.2375	-	-
0.5456	1400	0.2117	-	-
0.5846	1500	0.2058	0.2253	0.6748 (-0.3252)
0.6235	1600	0.2163	-	-
0.6625	1700	0.2235	-	-
0.7015	1800	0.2193	-	-
0.7405	1900	0.1924	-	-
0.7794	2000	0.2084	0.2095	0.6748 (-0.3252)
0.8184	2100	0.2113	-	-
0.8574	2200	0.2276	-	-
0.8963	2300	0.2071	-	-
0.9353	2400	0.2374	-	-
0.9743	2500	0.2173	0.2172	0.6667 (-0.3333)
1.0133	2600	0.2011	-	-
1.0522	2700	0.1634	-	-
1.0912	2800	0.1807	-	-
1.1302	2900	0.1878	-	-
1.1691	3000	0.2037	0.2147	0.6911 (-0.3089)
1.2081	3100	0.1904	-	-
1.2471	3200	0.1911	-	-
1.2860	3300	0.1828	-	-
1.3250	3400	0.1686	-	-
1.3640	3500	0.1892	0.2179	0.6992 (-0.3008)
1.4030	3600	0.188	-	-
1.4419	3700	0.1691	-	-
1.4809	3800	0.1946	-	-
1.5199	3900	0.1938	-	-
1.5588	4000	0.211	0.2088	0.6992 (-0.3008)
1.5978	4100	0.1826	-	-
1.6368	4200	0.1608	-	-
1.6758	4300	0.1782	-	-
1.7147	4400	0.1803	-	-
1.7537	4500	0.1804	0.2160	0.6911 (-0.3089)
1.7927	4600	0.1823	-	-
1.8316	4700	0.1844	-	-
1.8706	4800	0.1727	-	-
1.9096	4900	0.1937	-	-
1.9486	5000	0.1662	0.2219	0.6829 (-0.3171)
1.9875	5100	0.1653	-	-
2.0265	5200	0.1658	-	-
2.0655	5300	0.1316	-	-
2.1044	5400	0.1379	-	-
2.1434	5500	0.152	0.2513	0.6504 (-0.3496)
2.1824	5600	0.1848	-	-
2.2214	5700	0.1507	-	-
2.2603	5800	0.1495	-	-
2.2993	5900	0.1469	-	-
2.3383	6000	0.1596	0.2407	0.6585 (-0.3415)
2.3772	6100	0.1518	-	-
2.4162	6200	0.1351	-	-
2.4552	6300	0.1706	-	-
2.4942	6400	0.1538	-	-
2.5331	6500	0.1329	0.2505	0.6911 (-0.3089)
2.5721	6600	0.147	-	-
2.6111	6700	0.1289	-	-
2.6500	6800	0.1698	-	-
2.6890	6900	0.1456	-	-
2.7280	7000	0.141	0.2618	0.6748 (-0.3252)
2.7670	7100	0.1413	-	-
2.8059	7200	0.1474	-	-
2.8449	7300	0.1381	-	-
2.8839	7400	0.1252	-	-
2.9228	7500	0.1384	0.2608	0.6748 (-0.3252)
2.9618	7600	0.1826	-	-

加粗行表示保存的檢查點。

框架版本

Python：3.11.12
Sentence Transformers：4.1.0
Transformers：4.51.3
PyTorch：2.6.0+cu124
Accelerate：1.6.0
Datasets：3.6.0
Tokenizers：0.21.1

📄 引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}