opensearch-neural-sparse-encoding-doc-v2-distill開源模型 - 優化OpenSearch搜索，免推理編碼更高效

首頁

Opensearch Neural Sparse Encoding Doc V2 Distill

由opensearch-project開發

基於蒸餾技術的稀疏檢索模型，專為OpenSearch優化，支持免推理文檔編碼，在搜索相關性和效率上優於V1版本

文本嵌入

Transformers

英語開源協議:Apache-2.0 #無推理檢索 #稀疏向量編碼 #文檔搜索優化

下載量 1.8M

發布時間 : 7/17/2024

模型概述

該模型將文檔編碼為30522維稀疏向量，通過查詢/文檔稀疏向量的內積計算相似度得分，適用於高效檢索場景

模型特點

免推理文檔編碼

支持直接對文檔進行編碼而無需即時推理，顯著提升檢索效率

蒸餾優化

通過知識蒸餾技術壓縮模型規模，保持性能的同時減少計算資源消耗

高效稀疏檢索

利用稀疏向量表示和Lucene倒排索引實現高效相似度計算

多數據集訓練

融合MS MARCO、問答對等多種訓練數據，提升泛化能力

模型能力

文檔向量化編碼

查詢稀疏向量生成

語義相似度計算

高效檢索

使用案例

搜索引擎

OpenSearch神經搜索

作為OpenSearch的神經搜索插件，提供基於語義的文檔檢索能力

在BEIR基準測試中平均NDCG@10達到0.504

問答系統

問答對檢索

從知識庫中快速檢索與用戶問題相關的答案

在NQ數據集上NDCG@10達到0.528

🚀 opensearch-neural-sparse-encoding-doc-v2-distill

本項目是一個學習型稀疏檢索模型，可將文檔編碼為30522維的稀疏向量，適用於信息檢索場景，能在搜索相關性、模型推理和檢索效率方面取得較好平衡。

🚀 快速開始

本模型通常在OpenSearch集群中運行，但也可以藉助HuggingFace的模型API在集群外使用。以下是使用示例：

import json
import itertools
import torch

from transformers import AutoModelForMaskedLM, AutoTokenizer


# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    values = torch.log(1 + torch.relu(values))
    values[:,special_token_ids] = 0
    return values
    
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    
# download the idf file from model hub. idf is used to give weights for query tokens
def get_tokenizer_idf(tokenizer):
    from huggingface_hub import hf_hub_download
    local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill", filename="idf.json")
    with open(local_cached_path) as f:
        idf = json.load(f)
    idf_vector = [0]*tokenizer.vocab_size
    for token,weight in idf.items():
        _id = tokenizer._convert_token_to_id_with_added_voc(token)
        idf_vector[_id]=weight
    return torch.tensor(idf_vector)

# load the model
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill")
idf = get_tokenizer_idf(tokenizer)

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token



query = "What's the weather in ny now?"
document = "Currently New York is rainy."

# encode the query
feature_query = tokenizer([query], padding=True, truncation=True, return_tensors='pt')
input_ids = feature_query["input_ids"]
batch_size = input_ids.shape[0]
query_vector = torch.zeros(batch_size, tokenizer.vocab_size)
query_vector[torch.arange(batch_size).unsqueeze(-1), input_ids] = 1
query_sparse_vector = query_vector*idf

# encode the document
feature_document = tokenizer([document], padding=True, truncation=True, return_tensors='pt')
output = model(**feature_document)[0]
document_sparse_vector = get_sparse_vector(feature_document, output)


# get similarity score
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
print(sim_score)   # tensor(17.5307, grad_fn=<DotBackward0>)


query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
document_query_token_weight = transform_sparse_vector_to_dict(document_sparse_vector)[0]
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
        

        
# result:
# score in query: 5.7729, score in document: 1.4109, token: ny
# score in query: 4.5684, score in document: 1.4673, token: weather
# score in query: 3.5895, score in document: 0.7473, token: now

上述代碼展示了神經稀疏搜索的示例。儘管原始查詢和文檔中沒有重疊的標記，但該模型仍能實現良好的匹配。

✨ 主要特性

模型選擇

選擇模型時，應綜合考慮搜索相關性、模型推理和檢索效率（FLOPS）。我們在BEIR基準測試的一個子集上對模型的零樣本性能進行了基準測試，包括TrecCovid、NFCorpus、NQ、HotpotQA、FiQA、ArguAna、Touche、DBPedia、SCIDOCS、FEVER、Climate FEVER、SciFact、Quora。

總體而言，v2系列模型在搜索相關性、效率和推理速度方面優於v1系列。具體優缺點可能因不同數據集而異。

模型	免推理檢索	模型參數	平均NDCG@10	平均FLOPS
opensearch-neural-sparse-encoding-v1		1.33億	0.524	11.4
opensearch-neural-sparse-encoding-v2-distill		6700萬	0.528	8.3
opensearch-neural-sparse-encoding-doc-v1	✔️	1.33億	0.490	2.3
opensearch-neural-sparse-encoding-doc-v2-distill	✔️	6700萬	0.504	1.8
opensearch-neural-sparse-encoding-doc-v2-mini	✔️	2300萬	0.497	1.7

模型概述

論文：Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers
微調示例：opensearch-sparse-model-tuning-sample

這是一個學習型稀疏檢索模型，它將文檔編碼為30522維的稀疏向量。對於查詢，它僅使用分詞器和權重查找表來生成稀疏向量。非零維度索引表示詞彙表中對應的標記，權重表示標記的重要性。相似度得分是查詢/文檔稀疏向量的內積。

訓練數據集包括MS MARCO、eli5_question_answer、squad_pairs、WikiAnswers、yahoo_answers_title_question、gooaq_pairs、stackexchange_duplicate_questions_body_body、wikihow、S2ORC_title_abstract、stackexchange_duplicate_questions_title-body_title-body、yahoo_answers_question_answer、searchQA_top5_snippets、stackexchange_duplicate_questions_title_title、yahoo_answers_title_answer。

OpenSearch神經稀疏特徵支持使用Lucene倒排索引進行學習型稀疏檢索。鏈接：https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/。可以使用OpenSearch高級API進行索引和搜索。

詳細搜索相關性

模型	平均值	Trec Covid	NFCorpus	NQ	HotpotQA	FiQA	ArguAna	Touche	DBPedia	SCIDOCS	FEVER	Climate FEVER	SciFact	Quora
opensearch-neural-sparse-encoding-v1	0.524	0.771	0.360	0.553	0.697	0.376	0.508	0.278	0.447	0.164	0.821	0.263	0.723	0.856
opensearch-neural-sparse-encoding-v2-distill	0.528	0.775	0.347	0.561	0.685	0.374	0.551	0.278	0.435	0.173	0.849	0.249	0.722	0.863
opensearch-neural-sparse-encoding-doc-v1	0.490	0.707	0.352	0.521	0.677	0.344	0.461	0.294	0.412	0.154	0.743	0.202	0.716	0.788
opensearch-neural-sparse-encoding-doc-v2-distill	0.504	0.690	0.343	0.528	0.675	0.357	0.496	0.287	0.418	0.166	0.818	0.224	0.715	0.841
opensearch-neural-sparse-encoding-doc-v2-mini	0.497	0.709	0.336	0.510	0.666	0.338	0.480	0.285	0.407	0.164	0.812	0.216	0.699	0.837