Splade_PP_en_v2開源模型 - 工業場景適用，兼顧檢索質效與文檔擴展學習

首頁

Splade PP En V2

由prithivida開發

針對工業場景優化的SPLADE++模型實現，平衡檢索質量與效率，支持文檔擴展和稀疏表示學習

文本嵌入

Transformers

英語開源協議:Apache-2.0 #工業級稀疏檢索 #FLOPS優化 #文檔擴展

下載量 181

發布時間 : 3/13/2024

模型概述

基於SPLADE++的獨立實現，專注於在工業場景中優化檢索效率與成本，結合詞法搜索和語義搜索優勢

模型特點

工業級效率優化

嚴格控制文檔(128)和查詢(24)的FLOPS預算，顯著降低檢索延遲至48.81ms

稀疏表示學習

結合詞法搜索的可解釋性與語義搜索的泛化能力，自動擴展查詢詞項

雙模型策略

分離文檔與查詢模型以優化延遲，查詢模型即將發佈

領域適應性強

證明模型可在單CPU環境運行，支持低成本領域定製

模型能力

文檔稀疏編碼

查詢擴展

段落檢索

知識蒸餾

跨域零樣本檢索

使用案例

搜索引擎優化

企業文檔檢索

在有限計算資源下實現高效文檔檢索

MRR@10達37.8（ID數據）

知識管理

技術文檔檢索

處理專業術語的詞彙不匹配問題

OOD數據MRR@10達49.4

🚀 SPLADE++ 模型的獨立實現（適用於工業場景）

本項目是 SPLADE++ 模型的獨立實現，針對工業場景進行了一些效率優化。它結合了兩項重要研究的優勢，旨在提供高效且有效的稀疏表示檢索方案。

🚀 快速開始

本工作借鑑了兩項重要研究：Naver 的《From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective》以及 Google 的《SparseEmbed》。感謝這兩個團隊的傑出工作。

這是該系列的第二次迭代。你可以在此嘗試 V1 版本：prithivida/Splade_PP_en_v1

✨ 主要特性

1. 理解稀疏表示及其學習意義

**1. 詞法搜索**：基於詞袋（BOW）的稀疏向量詞法搜索是強大的基線方法，但它存在詞彙不匹配的問題，因為只能進行精確的術語匹配。其優缺點如下： - ✅ 高效且成本低。 - ✅ 無需微調模型。 - ✅ 可解釋性強。 - ✅ 精確的術語匹配。 - ❌ 詞彙不匹配（需要記住精確的術語）。

2. 語義搜索：學習型神經/密集檢索器（如 DPR、Sentence transformers*、BGE* 模型）結合近似最近鄰搜索已取得顯著成果。其優缺點如下：

✅ 搜索方式符合人類思維習慣。
✅ 微調後在性能上遠超稀疏檢索。
✅ 易於處理多模態數據。
❌ 存在令牌遺忘問題（錯過術語匹配）。
❌ 資源消耗大（索引和檢索階段均如此）。
❌ 難以解釋。
❌ 對於分佈外（OOD）數據需要微調。

3. 核心思路：結合兩種搜索方式的優點，促使人們對學習具有一定可解釋性的查詢和文檔稀疏表示產生興趣。稀疏表示還可作為查詢和文檔的隱式或顯式（潛在、上下文相關）擴展機制。如果你對查詢擴展不太熟悉，可以從 Daniel Tunkelang 那裡瞭解更多信息。

4. 稀疏模型的學習內容：模型學習將其學習到的密集表示投影到 MLM 頭部，以得到詞彙分佈。這意味著模型可以進行自動令牌擴展。（圖片來源：pinecone）

2. 動機

SPLADE 模型在檢索效果（質量）和檢索效率（延遲和成本）之間取得了很好的平衡。基於此，我們進行了一些細微的檢索效率調整，使其更適合工業場景。

我們的嘗試及結果總結

FLOPS 調整：採用獨立的序列長度和嚴格受限的 FLOPS 調度以及令牌預算，文檔為 128，查詢為 24，而非官方 SPLADE++ 的 256。靈感來源於 SparseEmbed。
初始化權重：使用經過中間訓練且帶有 MLM 損失的 bert-base-uncased 模型。與官方 SPLADE++ 或 ColBERT 類似，具有一定的語料感知能力。
在消費級 GPU 上，使用每個查詢僅 5 個負樣本的情況下，在 ID 數據上實現了 MRR@10 為 37.8 的競爭力效果（OOD 為 49.4），檢索延遲為 48.81ms（多線程）。
對於工業場景，在自定義領域的有效性不僅僅取決於“用 FLOPS 換取微小提升”，“SPLADE++ 不適合單核 CPU 檢索”這一觀點並不成立。
由於查詢時的推理延遲，我們仍然需要兩個模型，一個用於查詢，一個用於文檔。此為文檔模型，查詢模型即將發佈。

注意：論文中提到的最佳性能模型為 SPLADE++，為保持一致性，我們沿用了相同的名稱。

3. 為何 FLOPS 是工業場景的關鍵指標之一

雖然只有對大樣本進行實證分析才有意義，但這裡有一個簡單的示例，讓你對其有個大致瞭解。我們的模型與同類 SPLADE++ 模型（包括當前最優模型）相比，在實現相當競爭力的效果時，使用的令牌數量減少了約 4% 和 48%。（我們將在下一節展示定量結果。）

因此，“如何超越當前最優的 MRR”並非我們的目標，我們關注的是“以何種成本實現可接受的有效性，即 MRR@10”。隨意降低 lambda 值（λQ,λD，見上表）可以提高 MRR，但較低的 lambda 值意味著更高的 FLOPS、更多的令牌和更差的效率，這在工業場景中是不可取的。

我們的模型

number of actual dimensions:  121
SPLADE BOW rep:
 [('stress', 2.42), ('thermal', 2.31), ('glass', 2.27), ('pan', 1.78), ('heat', 1.66), ('glasses', 1.58), ('crack', 1.42), ('anxiety', 1.36), ('break', 1.31), ('window', 0.91), ('heating', 0.84), ('hot', 0.82), ('adjacent', 0.82), ('hotter', 0.82), ('if', 0.75), ('cause', 0.7), ('caused', 0.7), ('create', 0.7), ('factors', 0.69), ('created', 0.68), ('cracks', 0.67), ('breaks', 0.67), ('area', 0.66), ('##glass', 0.66), ('cracked', 0.63), ('areas', 0.6), ('cracking', 0.59), ('windows', 0.58), ('effect', 0.56), ('causes', 0.56), ('ruin', 0.54), ('severe', 0.54), ('too', 0.53), ('flame', 0.5), ('collapse', 0.49), ('stresses', 0.49), ('or', 0.48), ('physics', 0.47), ('temperature', 0.46), ('get', 0.46), ('heated', 0.45), ('problem', 0.45), ('energy', 0.44), ('hottest', 0.42), ('phenomenon', 0.42), ('sweating', 0.41), ('insulation', 0.39), ('level', 0.39), ('warm', 0.39), ('governed', 0.38), ('formation', 0.37), ('failure', 0.35), ('frank', 0.34), ('cooling', 0.32), ('fracture', 0.31), ('because', 0.31), ('crystal', 0.31), ('determined', 0.31), ('boiler', 0.31), ('mechanical', 0.3), ('shatter', 0.29), ('friction', 0.29), ('levels', 0.29), ('cold', 0.29), ('will', 0.29), ('ceramics', 0.29), ('factor', 0.28), ('crash', 0.28), ('reaction', 0.28), ('fatigue', 0.28), ('hazard', 0.27), ('##e', 0.26), ('anger', 0.26), ('bubble', 0.25), ('process', 0.24), ('cleaning', 0.23), ('surrounding', 0.22), ('theory', 0.22), ('sash', 0.22), ('distraction', 0.21), ('adjoining', 0.19), ('environmental', 0.19), ('ross', 0.18), ('formed', 0.17), ('broken', 0.16), ('affect', 0.16), ('##pan', 0.15), ('graphic', 0.14), ('damage', 0.14), ('bubbles', 0.13), ('windshield', 0.13), ('temporal', 0.13), ('roof', 0.12), ('strain', 0.12), ('clear', 0.09), ('ceramic', 0.08), ('stressed', 0.08), ('##uation', 0.08), ('cool', 0.08), ('expand', 0.07), ('storm', 0.07), ('shock', 0.07), ('psychological', 0.06), ('breaking', 0.06), ('##es', 0.06), ('melting', 0.05), ('burst', 0.05), ('sensing', 0.04), ('heats', 0.04), ('error', 0.03), ('weather', 0.03), ('drink', 0.03), ('fire', 0.03), ('vibration', 0.02), ('induced', 0.02), ('warmer', 0.02), ('leak', 0.02), ('fog', 0.02), ('safety', 0.01), ('surface', 0.01), ('##thermal', 0.0)]

naver/splade-cocondenser-ensembledistil（當前最優，多約 4% 的令牌 + FLOPS = 1.85）

number of actual dimensions:  126
SPLADE BOW rep:
 [('stress', 2.25), ('glass', 2.23), ('thermal', 2.18), ('glasses', 1.65), ('pan', 1.62), ('heat', 1.56), ('stressed', 1.42), ('crack', 1.31), ('break', 1.12), ('cracked', 1.1), ('hot', 0.93), ('created', 0.9), ('factors', 0.81), ('broken', 0.73), ('caused', 0.71), ('too', 0.71), ('damage', 0.69), ('if', 0.68), ('hotter', 0.65), ('governed', 0.61), ('heating', 0.59), ('temperature', 0.59), ('adjacent', 0.59), ('cause', 0.58), ('effect', 0.57), ('fracture', 0.56), ('bradford', 0.55), ('strain', 0.53), ('hammer', 0.51), ('brian', 0.48), ('error', 0.47), ('windows', 0.45), ('will', 0.45), ('reaction', 0.42), ('create', 0.42), ('windshield', 0.41), ('heated', 0.41), ('factor', 0.4), ('cracking', 0.39), ('failure', 0.38), ('mechanical', 0.38), ('when', 0.38), ('formed', 0.38), ('bolt', 0.38), ('mechanism', 0.37), ('warm', 0.37), ('areas', 0.36), ('area', 0.36), ('energy', 0.34), ('disorder', 0.33), ('barry', 0.33), ('shock', 0.32), ('determined', 0.32), ('gage', 0.32), ('sash', 0.31), ('theory', 0.31), ('level', 0.31), ('resistant', 0.31), ('brake', 0.3), ('window', 0.3), ('crash', 0.3), ('hazard', 0.29), ('##ink', 0.27), ('ceramic', 0.27), ('storm', 0.25), ('problem', 0.25), ('issue', 0.24), ('impact', 0.24), ('fridge', 0.24), ('injury', 0.23), ('ross', 0.22), ('causes', 0.22), ('affect', 0.21), ('pressure', 0.21), ('fatigue', 0.21), ('leak', 0.21), ('eye', 0.2), ('frank', 0.2), ('cool', 0.2), ('might', 0.19), ('gravity', 0.18), ('ray', 0.18), ('static', 0.18), ('collapse', 0.18), ('physics', 0.18), ('wave', 0.18), ('reflection', 0.17), ('parker', 0.17), ('strike', 0.17), ('hottest', 0.17), ('burst', 0.16), ('chance', 0.16), ('burn', 0.14), ('rubbing', 0.14), ('interference', 0.14), ('bailey', 0.13), ('vibration', 0.12), ('gilbert', 0.12), ('produced', 0.12), ('rock', 0.12), ('warmer', 0.11), ('get', 0.11), ('drink', 0.11), ('fireplace', 0.11), ('ruin', 0.1), ('brittle', 0.1), ('fragment', 0.1), ('stumble', 0.09), ('formation', 0.09), ('shatter', 0.08), ('great', 0.08), ('friction', 0.08), ('flash', 0.07), ('cracks', 0.07), ('levels', 0.07), ('smash', 0.04), ('fail', 0.04), ('fra', 0.04), ('##glass', 0.03), ('variables', 0.03), ('because', 0.02), ('knock', 0.02), ('sun', 0.02), ('crush', 0.01), ('##e', 0.01), ('anger', 0.01)]

naver/splade-v2-distil（多約 48% 的令牌 + FLOPS = 3.82）

number of actual dimensions:  234
SPLADE BOW rep:
 [('glass', 2.55), ('stress', 2.39), ('thermal', 2.38), ('glasses', 1.95), ('stressed', 1.87), ('crack', 1.84), ('cool', 1.78), ('heat', 1.62), ('pan', 1.6), ('break', 1.53), ('adjacent', 1.44), ('hotter', 1.43), ('strain', 1.21), ('area', 1.16), ('adjoining', 1.14), ('heated', 1.11), ('window', 1.07), ('stresses', 1.04), ('hot', 1.03), ('created', 1.03), ('create', 1.03), ('cause', 1.02), ('factors', 1.02), ('cooler', 1.01), ('broken', 1.0), ('too', 0.99), ('fracture', 0.96), ('collapse', 0.96), ('cracking', 0.95), ('great', 0.93), ('happen', 0.93), ('windows', 0.89), ('broke', 0.87), ('##e', 0.87), ('pressure', 0.84), ('hottest', 0.84), ('breaking', 0.83), ('govern', 0.79), ('shatter', 0.76), ('level', 0.75), ('heating', 0.69), ('temperature', 0.69), ('cracked', 0.69), ('panel', 0.68), ('##glass', 0.68), ('ceramic', 0.67), ('sash', 0.66), ('warm', 0.66), ('areas', 0.64), ('creating', 0.63), ('will', 0.62), ('tension', 0.61), ('cracks', 0.61), ('optical', 0.6), ('mechanism', 0.58), ('kelly', 0.58), ('determined', 0.58), ('generate', 0.58), ('causes', 0.56), ('if', 0.56), ('factor', 0.56), ('the', 0.56), ('chemical', 0.55), ('governed', 0.55), ('crystal', 0.55), ('strike', 0.55), ('microsoft', 0.54), ('creates', 0.53), ('than', 0.53), ('relation', 0.53), ('glazed', 0.52), ('compression', 0.51), ('painting', 0.51), ('governing', 0.5), ('harden', 0.49), ('solar', 0.48), ('reflection', 0.48), ('ic', 0.46), ('split', 0.45), ('mirror', 0.44), ('damage', 0.43), ('ring', 0.42), ('formation', 0.42), ('wall', 0.41), ('burst', 0.4), ('radiant', 0.4), ('determine', 0.4), ('one', 0.4), ('plastic', 0.39), ('furnace', 0.39), ('difference', 0.39), ('melt', 0.39), ('get', 0.39), ('contract', 0.38), ('forces', 0.38), ('gets', 0.38), ('produce', 0.38), ('surrounding', 0.37), ('vibration', 0.37), ('tile', 0.37), ('fail', 0.36), ('warmer', 0.36), ('rock', 0.35), ('fault', 0.35), ('roof', 0.34), ('burned', 0.34), ('physics', 0.33), ('welding', 0.33), ('why', 0.33), ('a', 0.32), ('pop', 0.32), ('and', 0.31), ('fra', 0.3), ('stat', 0.3), ('withstand', 0.3), ('sunglasses', 0.3), ('material', 0.29), ('ice', 0.29), ('generated', 0.29), ('matter', 0.29), ('frame', 0.28), ('elements', 0.28), ('then', 0.28), ('.', 0.28), ('pont', 0.28), ('blow', 0.28), ('snap', 0.27), ('metal', 0.26), ('effect', 0.26), ('reaction', 0.26), ('related', 0.25), ('aluminium', 0.25), ('neighboring', 0.25), ('weight', 0.25), ('steel', 0.25), ('bulb', 0.25), ('tear', 0.25), ('coating', 0.25), ('plumbing', 0.25), ('co', 0.25), ('microwave', 0.24), ('formed', 0.24), ('pipe', 0.23), ('drink', 0.23), ('chemistry', 0.23), ('energy', 0.22), ('reflect', 0.22), ('dynamic', 0.22), ('leak', 0.22), ('is', 0.22), ('lens', 0.21), ('frost', 0.21), ('lenses', 0.21), ('produced', 0.21), ('induced', 0.2), ('arise', 0.2), ('plate', 0.2), ('equations', 0.19), ('affect', 0.19), ('tired', 0.19), ('mirrors', 0.18), ('thickness', 0.18), ('bending', 0.18), ('cabinet', 0.17), ('apart', 0.17), ('##thermal', 0.17), ('gas', 0.17), ('equation', 0.17), ('relationship', 0.17), ('composition', 0.17), ('engineering', 0.17), ('block', 0.16), ('breaks', 0.16), ('when', 0.16), ('definition', 0.16), ('collapsed', 0.16), ('generation', 0.16), (',', 0.16), ('philips', 0.16), ('later', 0.15), ('wood', 0.15), ('neighbouring', 0.15), ('structural', 0.14), ('regulate', 0.14), ('neighbors', 0.13), ('lighting', 0.13), ('happens', 0.13), ('more', 0.13), ('property', 0.13), ('cooling', 0.12), ('shattering', 0.12), ('melting', 0.12), ('how', 0.11), ('cloud', 0.11), ('barriers', 0.11), ('lam', 0.11), ('conditions', 0.11), ('rule', 0.1), ('insulation', 0.1), ('bathroom', 0.09), ('convection', 0.09), ('cavity', 0.09), ('source', 0.08), ('properties', 0.08), ('bend', 0.08), ('bottles', 0.08), ('ceramics', 0.07), ('temper', 0.07), ('tense', 0.07), ('keller', 0.07), ('breakdown', 0.07), ('concrete', 0.07), ('simon', 0.07), ('solids', 0.06), ('windshield', 0.05), ('eye', 0.05), ('sunlight', 0.05), ('brittle', 0.03), ('caused', 0.03), ('suns', 0.03), ('floor', 0.02), ('components', 0.02), ('photo', 0.02), ('change', 0.02), ('sun', 0.01), ('crystals', 0.01), ('problem', 0.01), ('##proof', 0.01), ('parameters', 0.01), ('gases', 0.0), ('prism', 0.0), ('doing', 0.0), ('lattice', 0.0), ('ground', 0.0)]

注意 1：此特定段落用作比較示例。

4. 如何轉化為實證指標

我們的模型在令牌稀疏的情況下仍具有有效性，這意味著更快的檢索速度（用戶體驗）和更小的索引大小（成本）。以下是在標準 MS-MARCO 小型開發集上的平均檢索時間以及縮放後的總 FLOPS 損失的相應指標。

這就是 Google 的 SparseEmbed 引人關注的原因，因為它們也能以更低的 FLOPs 實現 SPLADE 級別的檢索效果。與 ColBERT 相比，SPLADE 和 SparseEmbed 以線性複雜度匹配查詢和文檔術語，而 ColBERT 的後期交互（即所有查詢 - 文檔術語對）具有二次複雜度。SparseEmbed 的挑戰在於它使用了一個名為 Top-k 的超參數來限制用於學習上下文密集表示的令牌數量，例如查詢和段落編碼分別使用 64 和 256 個令牌。

但尚不清楚這些超參數在其他領域或語言中的可遷移性如何（在一些語言中，如我們的母語泰米爾語，其黏著性很強，令牌的概念會有很大變化）。

**注意：為什麼選擇 Anserini 而不是 PISA？** Anserini 是一個基於 Lucene 的生產就緒庫。常見的工業搜索部署使用基於 Lucene 的 Solr 或 Elastic，因此性能具有可比性。PISA 的延遲對於工業應用來說無關緊要，因為它只是一個研究系統。完整的 [Anserini 評估日誌將很快更新]()，其中包含編碼、索引和查詢的詳細信息。

BEIR ZST OOD 性能：將添加到頁面末尾。

我們的模型在其他方面也有所不同

聯合冷凝器權重：與官方最佳的 SPLADE++ 或 SparseEmbed 不同，我們沒有從 Luyu/co-condenser* 模型初始化權重，但仍實現了聯合冷凝器 SPLADE 級別的性能。後續會詳細介紹。
相同大小的模型：官方 SPLADE++、SparseEmbed 和我們的模型都在相同大小的基礎模型上進行微調，即 bert-base-uncased。

5. 工業適用性的路線圖和未來方向

提高效率：這是一個永無止境的目標，我們將繼續提高服務和檢索效率。
自定義/領域微調：SPLADE 模型在分佈外（OOD）零樣本性能方面表現出色，但在工業場景中並不重要，因為我們需要能夠在自定義數據集或領域上進行微調。在新數據集上微調 SPLADE 成本較高，需要對查詢和段落進行標註。因此，我們將繼續探索如何在自定義數據集上實現經濟高效的微調。

💻 使用示例

基礎用法

與流行的向量數據庫一起使用

向量數據庫	Colab 鏈接
Pinecone
Qdrant	待確定

使用 SPLADERunner 庫

pip install spladerunner

# 一次性初始化
from spladerunner import Expander
# 默認模型是文檔擴展器。
expander = Expander()

# 示例文檔擴展
sparse_rep = expander.expand(
    ["The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science."])

使用 HuggingFace

注意：如果你是使用筆記本的用戶，請先登錄

!huggingface-cli login

在代碼中集成 如何在代碼中使用 HF 令牌進行以下更改

tokenizer = AutoTokenizer.from_pretrained('prithivida/Splade_PP_en_v1', token=<Your token>)
model = AutoModelForMaskedLM.from_pretrained('prithivida/Splade_PP_en_v1', token=<Your token>)

完整代碼

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

device = "cuda:0" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained('prithivida/Splade_PP_en_v1')
reverse_voc = {v: k for k, v in tokenizer.vocab.items()}
model = AutoModelForMaskedLM.from_pretrained('prithivida/Splade_PP_en_v1')
model.to(device)

sentence = """The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science."""

inputs = tokenizer(sentence, return_tensors='pt')
inputs = {key: val.to(device) for key, val in inputs.items()}
input_ids = inputs['input_ids']

attention_mask = inputs['attention_mask']

outputs = model(**inputs)

logits, attention_mask = outputs.logits, attention_mask
relu_log = torch.log(1 + torch.relu(logits))
weighted_log = relu_log * attention_mask.unsqueeze(-1)
max_val, _ = torch.max(weighted_log, dim=1)
vector = max_val.squeeze()


cols = vector.nonzero().squeeze().cpu().tolist()
print("number of actual dimensions: ", len(cols))
weights = vector[cols].cpu().tolist()

d = {k: v for k, v in zip(cols, weights)}
sorted_d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
bow_rep = []
for k, v in sorted_d.items():
    bow_rep.append((reverse_voc[k], round(v,2)))

print("SPLADE BOW rep:\n", bow_rep)

高級用法

目前暫無高級用法示例，後續可根據實際情況補充。

📚 詳細文檔

BEIR 零樣本 OOD 性能

訓練細節

待補充

🔧 技術細節

本模型在 FLOPS 調整、初始化權重等方面進行了優化，以提高檢索效率。具體來說，採用獨立的序列長度和嚴格受限的 FLOPS 調度以及令牌預算，文檔為 128，查詢為 24；使用經過中間訓練且帶有 MLM 損失的 bert-base-uncased 模型初始化權重。在消費級 GPU 上，使用每個查詢僅 5 個負樣本的情況下，實現了較好的檢索效果和較低的延遲。

📄 許可證

本項目採用 Apache-2.0 許可證。

致謝

感謝 Nils Reimers 提供的所有建議。
感謝 Anserini 庫的作者。

侷限性和偏差

BERT 模型的所有侷限性和偏差同樣適用於本微調工作。

引用

如果使用我們的模型或庫，請進行引用。引用信息如下：

Damodaran, P. (2024). Splade_PP_en_v2: Independent Implementation of SPLADE++ Model (`a.k.a splade-cocondenser* and family`) for the Industry setting. (Version 2.0.0) [Computer software].