ModernBERT-large-msmarco-bpr開源模型 - 支持語義搜索及文本相似度對比任務

首頁

Modernbert Large Msmarco Bpr

由BlackBeenie開發

這是一個從ModernBERT-large微調的sentence-transformers模型，用於將句子和段落映射到1024維的密集向量空間，支持語義文本相似性、語義搜索等任務。

文本嵌入

Safetensors

#長文本語義匹配 #高維向量檢索 #BPR損失優化

下載量 21

發布時間 : 2/7/2025

模型概述

該模型基於ModernBERT-large架構微調，專門用於句子和段落的向量表示，適用於多種自然語言處理任務。

模型特點

長文本處理能力

支持最大8192個標記的序列長度，適合處理長文檔和段落。

高效向量表示

將文本映射到1024維的密集向量空間，保留豐富的語義信息。

微調優化

基於ModernBERT-large架構進行專門微調，優化了句子相似度任務的表現。

模型能力

語義文本相似度計算

語義搜索

釋義挖掘

文本分類

文本聚類

使用案例

信息檢索

相關文檔檢索

根據查詢句子查找語義相似的相關文檔段落

可有效匹配語義相關但表述不同的文本內容

問答系統

答案段落匹配

將用戶問題與候選答案段落進行相似度匹配

可準確找到與問題最相關的答案段落

🚀 基於answerdotai/ModernBERT-large的句子轉換器

這是一個基於 answerdotai/ModernBERT-large 微調的 sentence-transformers 模型。它可以將句子和段落映射到一個1024維的密集向量空間，可用於語義文本相似度計算、語義搜索、釋義挖掘、文本分類、聚類等任務。

🚀 快速開始

本模型可以將句子和段落映射到一個1024維的密集向量空間，可用於語義文本相似度計算、語義搜索、釋義挖掘、文本分類、聚類等任務。

✨ 主要特性

基於 answerdotai/ModernBERT-large 進行微調。
能夠將句子和段落映射到1024維的密集向量空間。
可用於多種自然語言處理任務，如語義文本相似度計算、語義搜索等。

📦 安裝指南

首先，你需要安裝 Sentence Transformers 庫：

pip install -U sentence-transformers

💻 使用示例

基礎用法

from sentence_transformers import SentenceTransformer

# 從 🤗 Hub 下載模型
model = SentenceTransformer("BlackBeenie/ModernBERT-large-msmarco-bpr")
# 運行推理
sentences = [
    'what is the average top third score on the act',
    'North Dakota is among a dozen states where high school students are required to take the ACT before graduating. The state tied with Colorado for third with an average composite score of 20.6 this year. Utah was first with an average of 20.8 and Illinois was second at 20.7. ACT composite scores range from 1 to 36. The national average is 21.0. A total of 7,227 students in North Dakota took the ACT this year.',
    "The average ACT score composite at Duke is a 34. The 25th percentile ACT score is 32, and the 75th percentile ACT score is 35. In other words, a 32 places you below average, while a 35 will move you up to above average.f you're a junior or senior, your GPA is hard to change from this point on. If your GPA is at or below the school average of 4.19, you'll need a higher ACT score to compensate and show that you're prepared to take on college academics.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# 獲取嵌入向量的相似度分數
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 詳細文檔

模型詳情

模型描述

屬性	詳情
模型類型	句子轉換器
基礎模型	answerdotai/ModernBERT-large
最大序列長度	8192個標記
輸出維度	1024維
相似度函數	餘弦相似度

模型來源

文檔：Sentence Transformers 文檔
倉庫：GitHub 上的 Sentence Transformers
Hugging Face：Hugging Face 上的 Sentence Transformers

完整模型架構

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

訓練詳情

訓練數據集

未命名數據集

大小：498,970 個訓練樣本
列：sentence_0、sentence_1 和 sentence_2

基於前1000個樣本的近似統計信息：

	sentence_0	sentence_1	sentence_2
類型	字符串	字符串	字符串
詳情	最小：4 個標記平均：9.24 個標記最大：27 個標記	最小：23 個標記平均：83.71 個標記最大：279 個標記	最小：17 個標記平均：79.72 個標記最大：262 個標記

樣本：

sentence_0	sentence_1	sentence_2
`what is tongkat ali`	`Tongkat Ali is a very powerful herb that acts as a sex enhancer by naturally increasing the testosterone levels, and revitalizing sexual impotence, performance and pleasure. Tongkat Ali is also effective in building muscular volume & strength resulting to a healthy physique.`	`However, unlike tongkat ali extract, tongkat ali chipped root and root powder are not sterile. Thus, the raw consumption of root powder is not recommended. The traditional preparation in Indonesia and Malaysia is to boil chipped roots as a tea. A standard dosage would be 50 gram of chipped root per person per day.`
`cost to install engineered hardwood flooring`	`Burton says his customers typically spend about $8 per square foot for engineered hardwood flooring; add an additional $2 per square foot for installation. Minion says consumers should expect to pay $7 to $12 per square foot for quality hardwood flooring. âIf the homeowner buys the wood and you need somebody to install it, usually an installation goes for about $2 a square foot,â Bill LeBeau, owner of LeBeauâs Hardwood Floors of Huntersville, North Carolina, says.`	Installing hardwood flooring can cost between $9 and $12 per square foot, compared with about $3 to $5 per square foot for carpetâso some homeowners opt to install hardwood only in some rooms rather than throughout their home.However, carpet typically needs to be replaced if it becomes stained or worn out.ardwood flooring lasts longer than carpet, can be easier to keep clean and can be refinished. In the end, though, the decision about whether to install hardwood or carpeting in a bedroom should be based on your personal preference, at least if you intend to stay in the home for years.
`define pollute`	`pollutes; polluted; polluting. Learner's definition of POLLUTE. [+ object] : to make (land, water, air, etc.) dirty and not safe or suitable to use. Waste from the factory had polluted [=contaminated] the river. Miles of beaches were polluted by the oil spill. Car exhaust pollutes the air.`	`Definition of pollute written for English Language Learners from the Merriam-Webster Learner's Dictionary with audio pronunciations, usage examples, and count/noncount noun labels. Learner's Dictionary mobile search`

損失：beir.losses.bpr_loss.BPRLoss

訓練超參數

非默認超參數

eval_strategy: steps
per_device_train_batch_size: 32
per_device_eval_batch_size: 32
num_train_epochs: 5
fp16: True
multi_dataset_batch_sampler: round_robin

所有超參數

點擊展開

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 32
per_device_eval_batch_size: 32
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1
num_train_epochs: 5
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin

訓練日誌

點擊展開

輪次	步驟	訓練損失
0.0321	500	1.517
0.0641	1000	0.355
0.0962	1500	0.3123
0.1283	2000	0.2916
0.1603	2500	0.2805
0.1924	3000	0.2782
0.2245	3500	0.2806
0.2565	4000	0.2831
0.2886	4500	0.2837
0.3207	5000	0.2603
0.3527	5500	0.2529
0.3848	6000	0.2681
0.4169	6500	0.2573
0.4489	7000	0.2678
0.4810	7500	0.2786
0.5131	8000	0.2559
0.5451	8500	0.2771
0.5772	9000	0.2807
0.6092	9500	0.2627
0.6413	10000	0.2536
0.6734	10500	0.2607
0.7054	11000	0.2578
0.7375	11500	0.2615
0.7696	12000	0.2624
0.8016	12500	0.2491
0.8337	13000	0.2487
0.8658	13500	0.2524
0.8978	14000	0.2465
0.9299	14500	0.2575
0.9620	15000	0.2412
0.9940	15500	0.2514
1.0	15593	-
1.0261	16000	0.1599
1.0582	16500	0.1495
1.0902	17000	0.1494
1.1223	17500	0.1437
1.1544	18000	0.1541
1.1864	18500	0.1455
1.2185	19000	0.1424
1.2506	19500	0.1456
1.2826	20000	0.1552
1.3147	20500	0.1508
1.3468	21000	0.1474
1.3788	21500	0.1534
1.4109	22000	0.1505
1.4430	22500	0.149
1.4750	23000	0.1616
1.5071	23500	0.1528
1.5392	24000	0.1531
1.5712	24500	0.151
1.6033	25000	0.1666
1.6353	25500	0.153
1.6674	26000	0.1532
1.6995	26500	0.1614
1.7315	27000	0.1576
1.7636	27500	0.154
1.7957	28000	0.1597
1.8277	28500	0.1512
1.8598	29000	0.1652
1.8919	29500	0.151
1.9239	30000	0.1561
1.9560	30500	0.1508
1.9881	31000	0.1463
2.0	31186	-
2.0201	31500	0.0999
2.0522	32000	0.0829
2.0843	32500	0.0799
2.1163	33000	0.0843
2.1484	33500	0.091
2.1805	34000	0.0843
2.2125	34500	0.092
2.2446	35000	0.0879
2.2767	35500	0.0914
2.3087	36000	0.092
2.3408	36500	0.101
2.3729	37000	0.1038
2.4049	37500	0.1084
2.4370	38000	0.0923
2.4691	38500	0.1083
2.5011	39000	0.0909
2.5332	39500	0.0918
2.5653	40000	0.101
2.5973	40500	0.0935
2.6294	41000	0.0858
2.6615	41500	0.0821
2.6935	42000	0.0755
2.7256	42500	0.0902
2.7576	43000	0.0906
2.7897	43500	0.089
2.8218	44000	0.088
2.8538	44500	0.0866
2.8859	45000	0.0914
2.9180	45500	0.0903
2.9500	46000	0.0903
2.9821	46500	0.0932
3.0	46779	-
3.0142	47000	0.0724
3.0462	47500	0.0465
3.0783	48000	0.049
3.1104	48500	0.0458
3.1424	49000	0.0461
3.1745	49500	0.0456
3.2066	50000	0.0469
3.2386	50500	0.051
3.2707	51000	0.044
3.3028	51500	0.0551
3.3348	52000	0.0549
3.3669	52500	0.0539
3.3990	53000	0.0515
3.4310	53500	0.0544
3.4631	54000	0.044
3.4952	54500	0.0499
3.5272	55000	0.0557
3.5593	55500	0.0571
3.5914	56000	0.0673
3.6234	56500	0.0512
3.6555	57000	0.0474
3.6876	57500	0.049
3.7196	58000	0.0552
3.7517	58500	0.046
3.7837	59000	0.0488
3.8158	59500	0.0477
3.8479	60000	0.054
3.8799	60500	0.0595
3.9120	61000	0.0462
3.9441	61500	0.0472
3.9761	62000	0.0553
4.0	62372	-
4.0082	62500	0.0438
4.0403	63000	0.0178
4.0723	63500	0.0187
4.1044	64000	0.0219
4.1365	64500	0.0254
4.1685	65000	0.0222
4.2006	65500	0.0229
4.2327	66000	0.0206
4.2647	66500	0.0195
4.2968	67000	0.0184
4.3289	67500	0.0224
4.3609	68000	0.019
4.3930	68500	0.0204
4.4251	69000	0.0187
4.4571	69500	0.0207
4.4892	70000	0.0215
4.5213	70500	0.0194
4.5533	71000	0.0206
4.5854	71500	0.0189
4.6175	72000	0.0222
4.6495	72500	0.0198
4.6816	73000	0.0199
4.7137	73500	0.0155
4.7457	74000	0.0185
4.7778	74500	0.0176
4.8099	75000	0.0181
4.8419	75500	0.0165
4.8740	76000	0.0204
4.9060	76500	0.0163
4.9381	77000	0.0154
4.9702	77500	0.0194
5.0	77965	-

框架版本

Python: 3.11.11
Sentence Transformers: 3.4.1
Transformers: 4.48.2
PyTorch: 2.5.1+cu124
Accelerate: 1.3.0
Datasets: 3.2.0
Tokenizers: 0.21.0

📄 許可證

文檔中未提及相關信息。

📖 引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}