reranker-bert-tiny-gooaq-bce-tanh-v3開源模型 - 算文本相似度，助語義搜索與分類

首頁

Reranker Bert Tiny Gooaq Bce Tanh V3

由cross-encoder-testing開發

這是一個基於BERT-tiny微調的交叉編碼器模型，用於計算文本對的相似度分數，適用於語義搜索、文本分類等任務。

文本嵌入

Safetensors

英語開源協議:Apache-2.0 #問答重排序 #輕量級BERT #語義匹配

下載量 1,962

發布時間 : 3/4/2025

模型概述

該模型通過sentence-transformers庫開發，能夠計算文本對的相似度分數，可用於語義文本相似度、語義搜索、複述挖掘、文本分類、聚類等任務。

模型特點

高效輕量

基於BERT-tiny架構，模型體積小，推理速度快

語義相關性評估

能夠準確評估文本對之間的語義相關性

大規模訓練

在578,402條GooAQ數據上進行訓練

模型能力

文本相似度計算

語義搜索重排序

問答對匹配

文本分類

使用案例

信息檢索

搜索引擎結果重排序

對搜索引擎返回的結果進行相關性重排序

在gooaq-dev數據集上map達到0.5677

問答系統

問答對匹配

評估問題與候選答案的相關性

🚀 BERT-tiny在GooAQ上訓練的模型

這是一個基於Cross Encoder的模型，它使用sentence-transformers庫從prajjwal1/bert-tiny微調而來。該模型可以為文本對計算得分，可用於語義文本相似度、語義搜索、釋義挖掘、文本分類、聚類等任務。

此模型使用train_script.py進行訓練。

🚀 快速開始

本模型是一個基於Cross Encoder的微調模型，可用於計算文本對的得分，適用於多種自然語言處理任務。下面將介紹如何安裝依賴庫並使用該模型進行推理。

✨ 主要特性

多任務適用性：可用於語義文本相似度、語義搜索、釋義挖掘、文本分類、聚類等多種任務。
微調模型：基於prajjwal1/bert-tiny進行微調，能更好地適應特定任務。

📦 安裝指南

首先，你需要安裝Sentence Transformers庫：

pip install -U sentence-transformers

💻 使用示例

基礎用法

安裝好庫後，你可以加載模型並進行推理：

from sentence_transformers import CrossEncoder

# 從🤗 Hub下載模型
model = CrossEncoder("cross-encoder-testing/reranker-bert-tiny-gooaq-bce")
# 定義文本對
pairs = [
    ['are javascript developers in demand?', "JavaScript is the skill that is most in-demand for IT in 2020, according to a report from developer skills tester DevSkiller. The report, “Top IT Skills report 2020: Demand and Hiring Trends,” has JavaScript switching places with Java when compared to last year's report, with Java in third place this year, behind SQL."],
    ['are javascript developers in demand?', 'In one line difference between the two is: JavaScript is the programming language where as AngularJS is a framework based on JavaScript. ... It is also the basic for all java script based technologies like jquery, angular JS, bootstrap JS and so on. Angular JS is a framework written in javascript and uses MVC architecture.'],
    ['are javascript developers in demand?', 'Java applications are run in a virtual machine or web browser while JavaScript is run on a web browser. Java code is compiled whereas while JavaScript code is in text and in a web page. JavaScript is an OOP scripting language, whereas Java is an OOP programming language.'],
    ['are javascript developers in demand?', 'Things in the body tag are the things that should be displayed: the actual content. Javascript in the body is executed as it is read and as the page is rendered. Javascript in the head is interpreted before anything is rendered.'],
    ['are javascript developers in demand?', 'Web apps tend to be built using JavaScript, CSS and HTML5. Unlike mobile apps, there is no standard software development kit for building web apps. However, developers do have access to templates. Compared to mobile apps, web apps are usually quicker and easier to build — but they are much simpler in terms of features.'],
]
# 預測得分
scores = model.predict(pairs)
print(scores.shape)
# (5,)

# 或者根據與單個文本的相似度對不同文本進行排序
ranks = model.rank(
    'are javascript developers in demand?',
    [
        "JavaScript is the skill that is most in-demand for IT in 2020, according to a report from developer skills tester DevSkiller. The report, “Top IT Skills report 2020: Demand and Hiring Trends,” has JavaScript switching places with Java when compared to last year's report, with Java in third place this year, behind SQL.",
        'In one line difference between the two is: JavaScript is the programming language where as AngularJS is a framework based on JavaScript. ... It is also the basic for all java script based technologies like jquery, angular JS, bootstrap JS and so on. Angular JS is a framework written in javascript and uses MVC architecture.',
        'Java applications are run in a virtual machine or web browser while JavaScript is run on a web browser. Java code is compiled whereas while JavaScript code is in text and in a web page. JavaScript is an OOP scripting language, whereas Java is an OOP programming language.',
        'Things in the body tag are the things that should be displayed: the actual content. Javascript in the body is executed as it is read and as the page is rendered. Javascript in the head is interpreted before anything is rendered.',
        'Web apps tend to be built using JavaScript, CSS and HTML5. Unlike mobile apps, there is no standard software development kit for building web apps. However, developers do have access to templates. Compared to mobile apps, web apps are usually quicker and easier to build — but they are much simpler in terms of features.',
    ]
)
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]

📚 詳細文檔

模型詳情

屬性	詳情
模型類型	Cross Encoder
基礎模型	prajjwal1/bert-tiny
最大序列長度	512 tokens
輸出標籤數量	1 label
語言	en
許可證	apache-2.0

模型資源

文檔：Sentence Transformers Documentation
文檔：Cross Encoder Documentation
倉庫：Sentence Transformers on GitHub
Hugging Face：Cross Encoders on Hugging Face

評估指標

Cross Encoder重排序

數據集：gooaq-dev、NanoMSMARCO、NanoNFCorpus和NanoNQ
評估方法：使用CrossEncoderRerankingEvaluator進行評估

指標	gooaq-dev	NanoMSMARCO	NanoNFCorpus	NanoNQ
map	0.5677 (+0.0366)	0.4280 (-0.0616)	0.3397 (+0.0787)	0.4149 (-0.0047)
mrr@10	0.5558 (+0.0318)	0.4129 (-0.0646)	0.5196 (+0.0198)	0.4132 (-0.0135)
ndcg@10	0.6157 (+0.0245)	0.4772 (-0.0632)	0.3308 (+0.0058)	0.4859 (-0.0147)

Cross Encoder Nano BEIR

數據集：NanoBEIR_R100_mean
評估方法：使用CrossEncoderNanoBEIREvaluator進行評估

指標	值
map	0.3942 (+0.0041)
mrr@10	0.4486 (-0.0194)
ndcg@10	0.4313 (-0.0241)

訓練詳情

訓練數據集

未命名數據集
- 大小：578,402個訓練樣本
- 列：question、answer和label
- 基於前1000個樣本的近似統計信息： | | 問題 | 答案 | 標籤 | | ---- | ---- | ---- | ---- | | 類型 | string | string | int | | 詳情 |
  - 最小長度: 21個字符
  - 平均長度: 43.81個字符
  - 最大長度: 96個字符
  |
  - 最小長度: 51個字符
  - 平均長度: 252.46個字符
  - 最大長度: 405個字符
  |
  - 0: ~82.90%
  - 1: ~17.10%
  |
- 樣本： | 問題 | 答案 | 標籤 | | ---- | ---- | ---- | | are javascript developers in demand? | JavaScript is the skill that is most in-demand for IT in 2020, according to a report from developer skills tester DevSkiller. The report, “Top IT Skills report 2020: Demand and Hiring Trends,” has JavaScript switching places with Java when compared to last year's report, with Java in third place this year, behind SQL. | 1 | | are javascript developers in demand? | In one line difference between the two is: JavaScript is the programming language where as AngularJS is a framework based on JavaScript. ... It is also the basic for all java script based technologies like jquery, angular JS, bootstrap JS and so on. Angular JS is a framework written in javascript and uses MVC architecture. | 0 | | are javascript developers in demand? | Java applications are run in a virtual machine or web browser while JavaScript is run on a web browser. Java code is compiled whereas while JavaScript code is in text and in a web page. JavaScript is an OOP scripting language, whereas Java is an OOP programming language. | 0 |
損失函數：使用BinaryCrossEntropyLoss，參數如下：

{
    "activation_fct": "torch.nn.modules.linear.Identity",
    "pos_weight": 5
}

訓練超參數

非默認超參數

eval_strategy: steps
per_device_train_batch_size: 2048
per_device_eval_batch_size: 2048
learning_rate: 0.0005
num_train_epochs: 1
warmup_ratio: 0.1
seed: 12
bf16: True

所有超參數

點擊展開

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 2048
per_device_eval_batch_size: 2048
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 0.0005
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 12
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: True
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: proportional

訓練日誌

輪次	步數	訓練損失	gooaq-dev_ndcg@10	NanoMSMARCO_ndcg@10	NanoNFCorpus_ndcg@10	NanoNQ_ndcg@10	NanoBEIR_R100_mean_ndcg@10
-1	-1	-	0.0887 (-0.5025)	0.0063 (-0.5341)	0.3262 (+0.0012)	0.0000 (-0.5006)	0.1108 (-0.3445)
0.0035	1	1.1945	-	-	-	-	-
0.0707	20	1.1664	0.4082 (-0.1830)	0.1805 (-0.3600)	0.3168 (-0.0083)	0.2243 (-0.2763)	0.2405 (-0.2149)
0.1413	40	1.1107	0.5260 (-0.0652)	0.3453 (-0.1951)	0.3335 (+0.0085)	0.3430 (-0.1576)	0.3406 (-0.1147)
0.2120	60	1.022	0.5623 (-0.0289)	0.3929 (-0.1475)	0.3512 (+0.0262)	0.3472 (-0.1535)	0.3638 (-0.0916)
0.2827	80	0.973	0.5691 (-0.0221)	0.4048 (-0.1356)	0.3530 (+0.0280)	0.3833 (-0.1174)	0.3804 (-0.0750)
0.3534	100	0.963	0.5814 (-0.0098)	0.4385 (-0.1019)	0.3471 (+0.0221)	0.4227 (-0.0779)	0.4028 (-0.0526)
0.4240	120	0.9419	0.5963 (+0.0050)	0.4106 (-0.1298)	0.3540 (+0.0289)	0.4843 (-0.0163)	0.4163 (-0.0391)
0.4947	140	0.9331	0.5953 (+0.0041)	0.4310 (-0.1094)	0.3367 (+0.0117)	0.4163 (-0.0843)	0.3947 (-0.0607)
0.5654	160	0.9263	0.6070 (+0.0158)	0.4626 (-0.0778)	0.3443 (+0.0193)	0.4823 (-0.0184)	0.4297 (-0.0256)
0.6360	180	0.9212	0.6069 (+0.0156)	0.4602 (-0.0802)	0.3391 (+0.0141)	0.4782 (-0.0224)	0.4258 (-0.0295)
0.7067	200	0.901	0.6126 (+0.0214)	0.4602 (-0.0803)	0.3413 (+0.0162)	0.4780 (-0.0227)	0.4265 (-0.0289)
0.7774	220	0.8997	0.6136 (+0.0224)	0.4801 (-0.0604)	0.3349 (+0.0098)	0.4903 (-0.0103)	0.4351 (-0.0203)
0.8481	240	0.9021	0.6132 (+0.0220)	0.4850 (-0.0554)	0.3438 (+0.0188)	0.4855 (-0.0151)	0.4381 (-0.0173)
0.9187	260	0.9013	0.6188 (+0.0276)	0.4820 (-0.0584)	0.3387 (+0.0137)	0.4851 (-0.0156)	0.4353 (-0.0201)
0.9894	280	0.8996	0.6157 (+0.0245)	0.4772 (-0.0632)	0.3305 (+0.0054)	0.4859 (-0.0147)	0.4312 (-0.0242)
-1	-1	-	0.6157 (+0.0245)	0.4772 (-0.0632)	0.3308 (+0.0058)	0.4859 (-0.0147)	0.4313 (-0.0241)

環境影響

使用CodeCarbon測量碳排放：

能耗：0.019 kWh
碳排放：0.007 kg的CO2
使用時長：0.099小時

訓練硬件

是否使用雲服務：否
GPU型號：1 x NVIDIA GeForce RTX 3090
CPU型號：13th Gen Intel(R) Core(TM) i7-13700K
內存大小：31.78 GB

框架版本

Python: 3.11.6
Sentence Transformers: 3.5.0.dev0
Transformers: 4.48.3
PyTorch: 2.5.0+cu121
Accelerate: 1.3.0
Datasets: 2.20.0
Tokenizers: 0.21.0

🔧 技術細節

本模型基於Cross Encoder架構，使用BinaryCrossEntropyLoss作為損失函數進行微調。通過對特定數據集的訓練，模型能夠學習到文本對之間的語義關係，從而為文本對計算得分。在訓練過程中，使用了一系列超參數來控制訓練過程，如學習率、批次大小等。

📄 許可證

本模型使用apache-2.0許可證。

📖 引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}