OpenSearch神經稀疏編碼模型v1開源 - 高效實現搜索相關性及文檔檢索

Home

Opensearch Neural Sparse Encoding V1

Developed by opensearch-project

OpenSearch神經稀疏編碼模型v1，用於將查詢和文檔編碼為30522維稀疏向量，實現高效的搜索相關性和檢索

文本嵌入

Transformers

EnglishOpen Source License:Apache-2.0 #稀疏向量檢索 #零樣本搜索 #高效語義匹配

Downloads 10.20k

Release Time : 3/7/2024

Model Overview

這是一個學習型稀疏檢索模型，可將查詢和文檔編碼為30522維的稀疏向量，在搜索相關性和檢索效率方面表現出色。模型在MS MARCO數據集上進行訓練，支持使用Lucene倒排索引進行學習型稀疏檢索。

Model Features

高效稀疏編碼

將查詢和文檔編碼為30522維的稀疏向量，非零維度索引表示詞彙表中對應的標記，權重表示標記的重要性

優秀的相關性表現

在BEIR基準測試的多個數據集上表現出色，平均NDCG@10達到0.524

OpenSearch集成

專為OpenSearch集群設計，支持使用Lucene倒排索引進行高效檢索

零樣本性能

在未見過的數據集上也能表現良好，無需微調即可使用

Model Capabilities

文本稀疏編碼

信息檢索

查詢-文檔匹配

零樣本遷移學習

Use Cases

搜索引擎

文檔檢索

在大型文檔集合中高效檢索相關文檔

在BEIR基準測試中平均NDCG@10達到0.524

問答系統

匹配用戶問題與候選答案

在NQ數據集上NDCG@10達到0.553

專業領域搜索

科學文獻檢索

在科學文獻數據庫中檢索相關論文

在SciFact數據集上NDCG@10達到0.723

醫療信息檢索

檢索醫療相關文檔和信息

在TrecCovid數據集上NDCG@10達到0.771

🚀 opensearch-neural-sparse-encoding-v1

本項目是一個學習型稀疏檢索模型，可將查詢和文檔編碼為30522維的稀疏向量，在搜索相關性和檢索效率方面表現出色。

🚀 快速開始

本模型應在OpenSearch集群中運行，但也可以使用HuggingFace模型API在集群外使用。以下是使用示例：

import itertools
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer


# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    values = torch.log(1 + torch.relu(values))
    values[:,special_token_ids] = 0
    return values
    
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    

# load the model
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v1")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v1")

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token



query = "What's the weather in ny now?"
document = "Currently New York is rainy."

# encode the query & document
feature = tokenizer([query, document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature)[0]
sparse_vector = get_sparse_vector(feature, output)

# get similarity score
sim_score = torch.matmul(sparse_vector[0],sparse_vector[1])
print(sim_score)   # tensor(22.3299, grad_fn=<DotBackward0>)


query_token_weight, document_query_token_weight = transform_sparse_vector_to_dict(sparse_vector)
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
        

        
# result:
# score in query: 2.9262, score in document: 2.1335, token: ny
# score in query: 2.5206, score in document: 1.5277, token: weather
# score in query: 2.0373, score in document: 2.3489, token: york
# score in query: 1.5786, score in document: 0.8752, token: cool
# score in query: 1.4636, score in document: 1.5132, token: current
# score in query: 0.7761, score in document: 0.8860, token: season
# score in query: 0.7560, score in document: 0.6726, token: 2020
# score in query: 0.7222, score in document: 0.6292, token: summer
# score in query: 0.6888, score in document: 0.6419, token: nina
# score in query: 0.6451, score in document: 0.8200, token: storm
# score in query: 0.4698, score in document: 0.7635, token: brooklyn
# score in query: 0.4562, score in document: 0.1208, token: julian
# score in query: 0.3484, score in document: 0.3903, token: wow
# score in query: 0.3439, score in document: 0.4160, token: usa
# score in query: 0.2751, score in document: 0.8260, token: manhattan
# score in query: 0.2013, score in document: 0.7735, token: fog
# score in query: 0.1989, score in document: 0.2961, token: mood
# score in query: 0.1653, score in document: 0.3437, token: climate
# score in query: 0.1191, score in document: 0.1533, token: nature
# score in query: 0.0665, score in document: 0.0600, token: temperature
# score in query: 0.0552, score in document: 0.3396, token: windy

上述代碼示例展示了神經稀疏搜索的一個例子。雖然原始查詢和文檔中沒有重疊的標記，但該模型仍能實現良好的匹配。

✨ 主要特性

多數據集評估：在BEIR基準的一個子集上對模型的零樣本性能進行了基準測試，包括TrecCovid、NFCorpus、NQ等多個數據集。
性能優勢：總體而言，v2系列模型在搜索相關性、效率和推理速度方面優於v1系列，但具體優缺點可能因不同數據集而異。
稀疏向量編碼：將查詢和文檔編碼為30522維的稀疏向量，非零維度索引表示詞彙表中對應的標記，權重表示標記的重要性。

📚 詳細文檔

選擇模型

選擇模型時應考慮搜索相關性、模型推理和檢索效率（FLOPS）。以下是不同模型的性能對比：

模型	免推理檢索	模型參數	平均NDCG@10	平均FLOPS
opensearch-neural-sparse-encoding-v1		1.33億	0.524	11.4
opensearch-neural-sparse-encoding-v2-distill		6700萬	0.528	8.3
opensearch-neural-sparse-encoding-doc-v1	✔️	1.33億	0.490	2.3
opensearch-neural-sparse-encoding-doc-v2-distill	✔️	6700萬	0.504	1.8
opensearch-neural-sparse-encoding-doc-v2-mini	✔️	2300萬	0.497	1.7

詳細搜索相關性

模型	平均值	Trec Covid	NFCorpus	NQ	HotpotQA	FiQA	ArguAna	Touche	DBPedia	SCIDOCS	FEVER	Climate FEVER	SciFact	Quora
opensearch-neural-sparse-encoding-v1	0.524	0.771	0.360	0.553	0.697	0.376	0.508	0.278	0.447	0.164	0.821	0.263	0.723	0.856
opensearch-neural-sparse-encoding-v2-distill	0.528	0.775	0.347	0.561	0.685	0.374	0.551	0.278	0.435	0.173	0.849	0.249	0.722	0.863
opensearch-neural-sparse-encoding-doc-v1	0.490	0.707	0.352	0.521	0.677	0.344	0.461	0.294	0.412	0.154	0.743	0.202	0.716	0.788
opensearch-neural-sparse-encoding-doc-v2-distill	0.504	0.690	0.343	0.528	0.675	0.357	0.496	0.287	0.418	0.166	0.818	0.224	0.715	0.841
opensearch-neural-sparse-encoding-doc-v2-mini	0.497	0.709	0.336	0.510	0.666	0.338	0.480	0.285	0.407	0.164	0.812	0.216	0.699	0.837

模型概述

論文：Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers
微調示例：opensearch-sparse-model-tuning-sample

本模型是一個學習型稀疏檢索模型，在MS MARCO數據集上進行訓練。OpenSearch神經稀疏特徵支持使用Lucene倒排索引進行學習型稀疏檢索，鏈接：https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/ 。可以使用OpenSearch高級API進行索引和搜索。