reranker-bert-tiny-gooaq-bce開源模型 - 計算文本相似度，用於語義搜索等任務

首頁

Reranker Bert Tiny Gooaq Bce

由cross-encoder-testing開發

這是一個從bert-tiny微調而來的交叉編碼器模型，用於計算文本對的相似度分數，適用於語義文本相似度、語義搜索等多種任務。

文本嵌入

Safetensors

英語開源協議:Apache-2.0 #問答重排序 #輕量級BERT #語義匹配

下載量 37.19k

發布時間 : 2/26/2025

模型概述

該模型基於BERT-tiny架構，使用sentence-transformers庫開發，主要用於計算文本對的相似度分數，適用於語義文本相似度、語義搜索、複述挖掘、文本分類、聚類等任務。

模型特點

高效輕量

基於BERT-tiny架構，模型體積小，計算效率高

多任務適用

可用於語義文本相似度、語義搜索、複述挖掘、文本分類等多種任務

高性能

在多個評估數據集上表現良好，特別是在GooAQ-dev數據集上map達到0.5677

模型能力

計算文本相似度

語義搜索

文本分類

文本聚類

複述挖掘

使用案例

信息檢索

問答系統答案排序

對候選答案進行相關性排序，提升問答系統質量

在GooAQ-dev數據集上map達到0.5677

內容推薦

相關內容推薦

根據用戶查詢推薦相關內容

🚀 BERT-tiny在GooAQ上訓練的模型

這是一個基於Cross Encoder的模型，它使用sentence-transformers庫從prajjwal1/bert-tiny微調而來。該模型可以為文本對計算得分，可用於語義文本相似度、語義搜索、釋義挖掘、文本分類、聚類等任務。

此模型使用train_script.py進行訓練。

🚀 快速開始

本模型是一個基於Cross Encoder的微調模型，可用於計算文本對的得分，適用於語義文本相似度、語義搜索等多種任務。你可以按照以下步驟使用該模型。

✨ 主要特性

跨編碼器模型：基於Cross Encoder架構，能夠有效計算文本對的得分。
微調自預訓練模型：從prajjwal1/bert-tiny微調而來，結合了預訓練模型的優勢。
多任務適用性：可用於語義文本相似度、語義搜索、釋義挖掘、文本分類、聚類等多種任務。

📦 安裝指南

首先，你需要安裝Sentence Transformers庫：

pip install -U sentence-transformers

💻 使用示例

基礎用法

以下是一個使用該模型進行文本對得分計算的示例：

from sentence_transformers import CrossEncoder

# 從🤗 Hub下載模型
model = CrossEncoder("cross-encoder-testing/reranker-bert-tiny-gooaq-bce")
# 定義文本對
pairs = [
    ['are javascript developers in demand?', "JavaScript is the skill that is most in-demand for IT in 2020, according to a report from developer skills tester DevSkiller. The report, “Top IT Skills report 2020: Demand and Hiring Trends,” has JavaScript switching places with Java when compared to last year's report, with Java in third place this year, behind SQL."],
    ['are javascript developers in demand?', 'In one line difference between the two is: JavaScript is the programming language where as AngularJS is a framework based on JavaScript. ... It is also the basic for all java script based technologies like jquery, angular JS, bootstrap JS and so on. Angular JS is a framework written in javascript and uses MVC architecture.'],
    ['are javascript developers in demand?', 'Java applications are run in a virtual machine or web browser while JavaScript is run on a web browser. Java code is compiled whereas while JavaScript code is in text and in a web page. JavaScript is an OOP scripting language, whereas Java is an OOP programming language.'],
    ['are javascript developers in demand?', 'Things in the body tag are the things that should be displayed: the actual content. Javascript in the body is executed as it is read and as the page is rendered. Javascript in the head is interpreted before anything is rendered.'],
    ['are javascript developers in demand?', 'Web apps tend to be built using JavaScript, CSS and HTML5. Unlike mobile apps, there is no standard software development kit for building web apps. However, developers do have access to templates. Compared to mobile apps, web apps are usually quicker and easier to build — but they are much simpler in terms of features.'],
]
# 預測得分
scores = model.predict(pairs)
print(scores.shape)
# (5,)

# 或者根據與單個文本的相似度對不同文本進行排序
ranks = model.rank(
    'are javascript developers in demand?',
    [
        "JavaScript is the skill that is most in-demand for IT in 2020, according to a report from developer skills tester DevSkiller. The report, “Top IT Skills report 2020: Demand and Hiring Trends,” has JavaScript switching places with Java when compared to last year's report, with Java in third place this year, behind SQL.",
        'In one line difference between the two is: JavaScript is the programming language where as AngularJS is a framework based on JavaScript. ... It is also the basic for all java script based technologies like jquery, angular JS, bootstrap JS and so on. Angular JS is a framework written in javascript and uses MVC architecture.',
        'Java applications are run in a virtual machine or web browser while JavaScript is run on a web browser. Java code is compiled whereas while JavaScript code is in text and in a web page. JavaScript is an OOP scripting language, whereas Java is an OOP programming language.',
        'Things in the body tag are the things that should be displayed: the actual content. Javascript in the body is executed as it is read and as the page is rendered. Javascript in the head is interpreted before anything is rendered.',
        'Web apps tend to be built using JavaScript, CSS and HTML5. Unlike mobile apps, there is no standard software development kit for building web apps. However, developers do have access to templates. Compared to mobile apps, web apps are usually quicker and easier to build — but they are much simpler in terms of features.',
    ]
)
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]

📚 詳細文檔

模型詳情

屬性	詳情
模型類型	Cross Encoder
基礎模型	prajjwal1/bert-tiny
最大序列長度	512 tokens
輸出標籤數量	1 個標籤
語言	英語
許可證	apache-2.0

模型資源

文檔：Sentence Transformers文檔
文檔：Cross Encoder文檔
倉庫：GitHub上的Sentence Transformers
Hugging Face：Hugging Face上的Cross Encoders

評估指標

Cross Encoder重排序

數據集：gooaq-dev、NanoMSMARCO、NanoNFCorpus 和 NanoNQ
使用CrossEncoderRerankingEvaluator進行評估

指標	gooaq-dev	NanoMSMARCO	NanoNFCorpus	NanoNQ
map	0.5677 (+0.0366)	0.4280 (-0.0616)	0.3397 (+0.0787)	0.4149 (-0.0047)
mrr@10	0.5558 (+0.0318)	0.4129 (-0.0646)	0.5196 (+0.0198)	0.4132 (-0.0135)
ndcg@10	0.6157 (+0.0245)	0.4772 (-0.0632)	0.3308 (+0.0058)	0.4859 (-0.0147)

Cross Encoder Nano BEIR

數據集：NanoBEIR_R100_mean
使用CrossEncoderNanoBEIREvaluator進行評估

指標	值
map	0.3942 (+0.0041)
mrr@10	0.4486 (-0.0194)
ndcg@10	0.4313 (-0.0241)

訓練詳情

訓練數據集

未命名數據集

大小：578,402個訓練樣本
列：question、answer 和 label

基於前1000個樣本的近似統計信息：

	問題	答案	標籤
類型	字符串	字符串	整數
詳情	最小：21個字符平均：43.81個字符最大：96個字符	最小：51個字符平均：252.46個字符最大：405個字符	0：~82.90% 1：~17.10%

樣本：

問題	答案	標籤
`are javascript developers in demand?`	`JavaScript is the skill that is most in-demand for IT in 2020, according to a report from developer skills tester DevSkiller. The report, “Top IT Skills report 2020: Demand and Hiring Trends,” has JavaScript switching places with Java when compared to last year's report, with Java in third place this year, behind SQL.`	`1`
`are javascript developers in demand?`	`In one line difference between the two is: JavaScript is the programming language where as AngularJS is a framework based on JavaScript. ... It is also the basic for all java script based technologies like jquery, angular JS, bootstrap JS and so on. Angular JS is a framework written in javascript and uses MVC architecture.`	`0`
`are javascript developers in demand?`	`Java applications are run in a virtual machine or web browser while JavaScript is run on a web browser. Java code is compiled whereas while JavaScript code is in text and in a web page. JavaScript is an OOP scripting language, whereas Java is an OOP programming language.`	`0`

損失函數：BinaryCrossEntropyLoss，參數如下：

{
    "activation_fct": "torch.nn.modules.linear.Identity",
    "pos_weight": 5
}

訓練超參數

非默認超參數
- eval_strategy: steps
- per_device_train_batch_size: 2048
- per_device_eval_batch_size: 2048
- learning_rate: 0.0005
- num_train_epochs: 1
- warmup_ratio: 0.1
- seed: 12
- bf16: True
所有超參數：點擊下面的展開按鈕查看

點擊展開

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 2048
per_device_eval_batch_size: 2048
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 0.0005
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 12
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: True
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: proportional

訓練日誌

輪次	步數	訓練損失	gooaq-dev_ndcg@10	NanoMSMARCO_ndcg@10	NanoNFCorpus_ndcg@10	NanoNQ_ndcg@10	NanoBEIR_R100_mean_ndcg@10
-1	-1	-	0.0887 (-0.5025)	0.0063 (-0.5341)	0.3262 (+0.0012)	0.0000 (-0.5006)	0.1108 (-0.3445)
0.0035	1	1.1945	-	-	-	-	-
0.0707	20	1.1664	0.4082 (-0.1830)	0.1805 (-0.3600)	0.3168 (-0.0083)	0.2243 (-0.2763)	0.2405 (-0.2149)
0.1413	40	1.1107	0.5260 (-0.0652)	0.3453 (-0.1951)	0.3335 (+0.0085)	0.3430 (-0.1576)	0.3406 (-0.1147)
0.2120	60	1.022	0.5623 (-0.0289)	0.3929 (-0.1475)	0.3512 (+0.0262)	0.3472 (-0.1535)	0.3638 (-0.0916)
0.2827	80	0.973	0.5691 (-0.0221)	0.4048 (-0.1356)	0.3530 (+0.0280)	0.3833 (-0.1174)	0.3804 (-0.0750)
0.3534	100	0.963	0.5814 (-0.0098)	0.4385 (-0.1019)	0.3471 (+0.0221)	0.4227 (-0.0779)	0.4028 (-0.0526)
0.4240	120	0.9419	0.5963 (+0.0050)	0.4106 (-0.1298)	0.3540 (+0.0289)	0.4843 (-0.0163)	0.4163 (-0.0391)
0.4947	140	0.9331	0.5953 (+0.0041)	0.4310 (-0.1094)	0.3367 (+0.0117)	0.4163 (-0.0843)	0.3947 (-0.0607)
0.5654	160	0.9263	0.6070 (+0.0158)	0.4626 (-0.0778)	0.3443 (+0.0193)	0.4823 (-0.0184)	0.4297 (-0.0256)
0.6360	180	0.9212	0.6069 (+0.0156)	0.4602 (-0.0802)	0.3391 (+0.0141)	0.4782 (-0.0224)	0.4258 (-0.0295)
0.7067	200	0.901	0.6126 (+0.0214)	0.4602 (-0.0803)	0.3413 (+0.0162)	0.4780 (-0.0227)	0.4265 (-0.0289)
0.7774	220	0.8997	0.6136 (+0.0224)	0.4801 (-0.0604)	0.3349 (+0.0098)	0.4903 (-0.0103)	0.4351 (-0.0203)
0.8481	240	0.9021	0.6132 (+0.0220)	0.4850 (-0.0554)	0.3438 (+0.0188)	0.4855 (-0.0151)	0.4381 (-0.0173)
0.9187	260	0.9013	0.6188 (+0.0276)	0.4820 (-0.0584)	0.3387 (+0.0137)	0.4851 (-0.0156)	0.4353 (-0.0201)
0.9894	280	0.8996	0.6157 (+0.0245)	0.4772 (-0.0632)	0.3305 (+0.0054)	0.4859 (-0.0147)	0.4312 (-0.0242)
-1	-1	-	0.6157 (+0.0245)	0.4772 (-0.0632)	0.3308 (+0.0058)	0.4859 (-0.0147)	0.4313 (-0.0241)

環境影響

使用CodeCarbon測量碳排放：

能源消耗：0.019 kWh
碳排放：0.007 kg CO2
使用時長：0.099小時

訓練硬件

是否使用雲服務：否
GPU型號：1 x NVIDIA GeForce RTX 3090
CPU型號：13th Gen Intel(R) Core(TM) i7-13700K
內存大小：31.78 GB

框架版本

Python: 3.11.6
Sentence Transformers: 3.5.0.dev0
Transformers: 4.48.3
PyTorch: 2.5.0+cu121
Accelerate: 1.3.0
Datasets: 2.20.0
Tokenizers: 0.21.0

📄 許可證

本項目採用apache-2.0許可證。

📖 引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}