OpenSearch神经稀疏编码模型v2蒸馏版开源 - 高效实现查询与文档稀疏检索

首页

Opensearch Neural Sparse Encoding V2 Distill

由 opensearch-project 开发

OpenSearch神经稀疏编码模型v2蒸馏版是一个高效的学习型稀疏检索模型，专为OpenSearch设计，能够将查询和文档编码为30522维稀疏向量。

文本嵌入

Transformers

英语开源协议:Apache-2.0 #学习型稀疏检索 #零样本性能 #高效推理

下载量 4,964

发布时间 : 7/17/2024

模型简介

该模型主要用于检索任务，能够将查询和文档转换为稀疏向量，支持基于Lucene倒排索引的稀疏检索，适用于多种信息检索场景。

模型特点

高效稀疏检索

支持基于Lucene倒排索引的稀疏检索，提高检索效率。

蒸馏版优化

相比基础版模型，参数量减少一半，同时保持或提升性能。

多数据集训练

训练数据包含MS MARCO、eli5问答、squad问答对等14个公开数据集。

语义关联匹配

即使原始文本无重叠词，模型仍能通过语义关联实现有效匹配。

模型能力

文本检索

查询扩展

文档扩展

语义匹配

使用案例

信息检索

文档检索

在大型文档库中快速检索相关文档。

在BEIR基准测试子集上平均NDCG@10达到0.528

问答系统

用于问答系统中的相关段落检索。

在NQ(自然问答)数据集上NDCG@10达到0.561

搜索引擎

OpenSearch集成

作为OpenSearch的神经稀疏检索功能的核心组件。

支持基于Lucene倒排索引的高效检索

🚀 开源搜索神经稀疏编码 v2 蒸馏模型

本项目是一个学习型稀疏检索模型，可将查询和文档编码为 30522 维的稀疏向量，在搜索相关性、效率和推理速度方面表现出色。

🚀 快速开始

模型选择

选择模型时，应综合考虑搜索相关性、模型推理和检索效率（FLOPS）。我们在 BEIR 基准测试的一个子集上对模型的零样本性能进行了基准测试，包括 TrecCovid、NFCorpus、NQ、HotpotQA、FiQA、ArguAna、Touche、DBPedia、SCIDOCS、FEVER、Climate FEVER、SciFact、Quora。

总体而言，v2 系列模型在搜索相关性、效率和推理速度方面优于 v1 系列。具体的优缺点可能因不同的数据集而异。

模型	检索无需推理	模型参数	平均 NDCG@10	平均 FLOPS
opensearch-neural-sparse-encoding-v1		1.33 亿	0.524	11.4
opensearch-neural-sparse-encoding-v2-distill		6700 万	0.528	8.3
opensearch-neural-sparse-encoding-doc-v1	✔️	1.33 亿	0.490	2.3
opensearch-neural-sparse-encoding-doc-v2-distill	✔️	6700 万	0.504	1.8
opensearch-neural-sparse-encoding-doc-v2-mini	✔️	2300 万	0.497	1.7

模型概述

论文：Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers
微调示例：opensearch-sparse-model-tuning-sample

这是一个学习型稀疏检索模型，它将查询和文档编码为 30522 维的稀疏向量。非零维度索引表示词汇表中对应的标记，权重表示该标记的重要性。

训练数据集包括 MS MARCO、eli5_question_answer、squad_pairs、WikiAnswers、yahoo_answers_title_question、gooaq_pairs、stackexchange_duplicate_questions_body_body、wikihow、S2ORC_title_abstract、stackexchange_duplicate_questions_title-body_title-body、yahoo_answers_question_answer、searchQA_top5_snippets、stackexchange_duplicate_questions_title_title、yahoo_answers_title_answer。

OpenSearch 神经稀疏特征支持使用 Lucene 倒排索引进行学习型稀疏检索。链接：https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/。可以使用 OpenSearch 高级 API 进行索引和搜索。

💻 使用示例

基础用法

import itertools
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer


# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    values = torch.log(1 + torch.relu(values))
    values[:,special_token_ids] = 0
    return values
    
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    

# load the model
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v2-distill")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v2-distill")

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token



query = "What's the weather in ny now?"
document = "Currently New York is rainy."

# encode the query & document
feature = tokenizer([query, document], padding=True, truncation=True, return_tensors='pt')
output = model(**feature)[0]
sparse_vector = get_sparse_vector(feature, output)

# get similarity score
sim_score = torch.matmul(sparse_vector[0],sparse_vector[1])
print(sim_score)   # tensor(38.6112, grad_fn=<DotBackward0>)


query_token_weight, document_query_token_weight = transform_sparse_vector_to_dict(sparse_vector)
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
        

        
# result:
# score in query: 2.7273, score in document: 2.9088, token: york
# score in query: 2.5734, score in document: 0.9208, token: now
# score in query: 2.3895, score in document: 1.7237, token: ny
# score in query: 2.2184, score in document: 1.2368, token: weather
# score in query: 1.8693, score in document: 1.4146, token: current
# score in query: 1.5887, score in document: 0.7450, token: today
# score in query: 1.4704, score in document: 0.9247, token: sunny
# score in query: 1.4374, score in document: 1.9737, token: nyc
# score in query: 1.4347, score in document: 1.6019, token: currently
# score in query: 1.1605, score in document: 0.9794, token: climate
# score in query: 1.0944, score in document: 0.7141, token: upstate
# score in query: 1.0471, score in document: 0.5519, token: forecast
# score in query: 0.9268, score in document: 0.6692, token: verve
# score in query: 0.9126, score in document: 0.4486, token: huh
# score in query: 0.8960, score in document: 0.7706, token: greene
# score in query: 0.8779, score in document: 0.7120, token: picturesque
# score in query: 0.8471, score in document: 0.4183, token: pleasantly
# score in query: 0.8079, score in document: 0.2140, token: windy
# score in query: 0.7537, score in document: 0.4925, token: favorable
# score in query: 0.7519, score in document: 2.1456, token: rain
# score in query: 0.7277, score in document: 0.3818, token: skies
# score in query: 0.6995, score in document: 0.8593, token: lena
# score in query: 0.6895, score in document: 0.2410, token: sunshine
# score in query: 0.6621, score in document: 0.3016, token: johnny
# score in query: 0.6604, score in document: 0.1933, token: skyline
# score in query: 0.6117, score in document: 0.2197, token: sasha
# score in query: 0.5962, score in document: 0.0414, token: vibe
# score in query: 0.5381, score in document: 0.7560, token: hardly
# score in query: 0.4582, score in document: 0.4243, token: prevailing
# score in query: 0.4539, score in document: 0.5073, token: unpredictable
# score in query: 0.4350, score in document: 0.8463, token: presently
# score in query: 0.3674, score in document: 0.2496, token: hail
# score in query: 0.3324, score in document: 0.5506, token: shivered
# score in query: 0.3281, score in document: 0.1964, token: wind
# score in query: 0.3052, score in document: 0.5785, token: rudy
# score in query: 0.2797, score in document: 0.0357, token: looming
# score in query: 0.2712, score in document: 0.0870, token: atmospheric
# score in query: 0.2471, score in document: 0.3490, token: vicky
# score in query: 0.2247, score in document: 0.2383, token: sandy
# score in query: 0.2154, score in document: 0.5737, token: crowded
# score in query: 0.1723, score in document: 0.1857, token: chilly
# score in query: 0.1700, score in document: 0.4110, token: blizzard
# score in query: 0.1183, score in document: 0.0613, token: ##cken
# score in query: 0.0923, score in document: 0.6363, token: unrest
# score in query: 0.0624, score in document: 0.2127, token: russ
# score in query: 0.0558, score in document: 0.5542, token: blackout
# score in query: 0.0549, score in document: 0.1589, token: kahn
# score in query: 0.0160, score in document: 0.0566, token: 2020
# score in query: 0.0125, score in document: 0.3753, token: nighttime

上述代码示例展示了神经稀疏搜索的一个示例。尽管原始查询和文档中没有重叠的标记，但该模型仍能实现良好的匹配。

📚 详细文档

详细搜索相关性

模型	平均值	Trec Covid	NFCorpus	NQ	HotpotQA	FiQA	ArguAna	Touche	DBPedia	SCIDOCS	FEVER	Climate FEVER	SciFact	Quora
opensearch-neural-sparse-encoding-v1	0.524	0.771	0.360	0.553	0.697	0.376	0.508	0.278	0.447	0.164	0.821	0.263	0.723	0.856
opensearch-neural-sparse-encoding-v2-distill	0.528	0.775	0.347	0.561	0.685	0.374	0.551	0.278	0.435	0.173	0.849	0.249	0.722	0.863
opensearch-neural-sparse-encoding-doc-v1	0.490	0.707	0.352	0.521	0.677	0.344	0.461	0.294	0.412	0.154	0.743	0.202	0.716	0.788
opensearch-neural-sparse-encoding-doc-v2-distill	0.504	0.690	0.343	0.528	0.675	0.357	0.496	0.287	0.418	0.166	0.818	0.224	0.715	0.841
opensearch-neural-sparse-encoding-doc-v2-mini	0.497	0.709	0.336	0.510	0.666	0.338	0.480	0.285	0.407	0.164	0.812	0.216	0.699	0.837