OpenSearch Neural Sparse Encoding Doc v2 Distill Open Source Model - Optimize OpenSearch Search, More Efficient without Inference Encoding

Opensearch Neural Sparse Encoding Doc V2 Distill

Developed by opensearch-project

A sparse retrieval model based on distillation technology, optimized for OpenSearch, supporting inference-free document encoding with improved search relevance and efficiency over V1

Text Embedding

Transformers

EnglishOpen Source License:Apache-2.0 #No-inference retrieval #Sparse vector encoding #Document search optimization

Downloads 1.8M

Release Time : 7/17/2024

Model Overview

This model encodes documents into 30522-dimensional sparse vectors, calculating similarity scores via inner product of query/document sparse vectors, suitable for high-efficiency retrieval scenarios

Model Features

Inference-free document encoding

Supports direct document encoding without real-time inference, significantly improving retrieval efficiency

Distillation optimization

Compresses model size through knowledge distillation while maintaining performance and reducing computational resource consumption

Efficient sparse retrieval

Utilizes sparse vector representation and Lucene inverted index for efficient similarity calculation

Multi-dataset training

Incorporates diverse training data like MS MARCO and Q&A pairs to enhance generalization capability

Model Capabilities

Document vectorization encoding

Query sparse vector generation

Semantic similarity calculation

High-efficiency retrieval

Use Cases

Search engines

OpenSearch neural search

Serves as OpenSearch's neural search plugin, providing semantic-based document retrieval capabilities

Achieves average NDCG@10 of 0.504 on BEIR benchmark

Q&A systems

Q&A pair retrieval

Rapidly retrieves answers relevant to user questions from knowledge bases

Achieves NDCG@10 of 0.528 on NQ dataset

🚀 opensearch-neural-sparse-encoding-doc-v2-distill

This is a learned sparse retrieval model that encodes documents into 30522-dimensional sparse vectors, enabling efficient retrieval with OpenSearch.

🚀 Quick Start

This model is designed for learned sparse retrieval. It encodes documents into 30522-dimensional sparse vectors. For queries, it uses a tokenizer and a weight look-up table to generate sparse vectors. The similarity score is calculated as the inner product of query and document sparse vectors.

✨ Features

High Search Relevance: The v2 series of models generally offer better search relevance compared to the v1 series, as demonstrated by benchmark results on multiple datasets.
Inference-Free Retrieval: Some models support inference-free retrieval, which can significantly improve retrieval efficiency.
Efficient Encoding: Encodes documents into sparse vectors, reducing storage and computational requirements.

📦 Model Selection

When selecting a model, consider search relevance, model inference, and retrieval efficiency (FLOPS). We benchmarked the models' zero-shot performance on a subset of the BEIR benchmark, including TrecCovid, NFCorpus, NQ, HotpotQA, FiQA, ArguAna, Touche, DBPedia, SCIDOCS, FEVER, Climate FEVER, SciFact, and Quora.

Overall, the v2 series of models have better search relevance, efficiency, and inference speed than the v1 series. However, the specific advantages and disadvantages may vary across different datasets.

Model	Inference-free for Retrieval	Model Parameters	AVG NDCG@10	AVG FLOPS
opensearch-neural-sparse-encoding-v1		133M	0.524	11.4
opensearch-neural-sparse-encoding-v2-distill		67M	0.528	8.3
opensearch-neural-sparse-encoding-doc-v1	✔️	133M	0.490	2.3
opensearch-neural-sparse-encoding-doc-v2-distill	✔️	67M	0.504	1.8
opensearch-neural-sparse-encoding-doc-v2-mini	✔️	23M	0.497	1.7

💻 Usage Examples

Basic Usage

This model is supposed to run inside an OpenSearch cluster. But you can also use it outside the cluster, with HuggingFace models API.

import json
import itertools
import torch

from transformers import AutoModelForMaskedLM, AutoTokenizer


# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    values = torch.log(1 + torch.relu(values))
    values[:,special_token_ids] = 0
    return values
    
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    
# download the idf file from model hub. idf is used to give weights for query tokens
def get_tokenizer_idf(tokenizer):
    from huggingface_hub import hf_hub_download
    local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill", filename="idf.json")
    with open(local_cached_path) as f:
        idf = json.load(f)
    idf_vector = [0]*tokenizer.vocab_size
    for token,weight in idf.items():
        _id = tokenizer._convert_token_to_id_with_added_voc(token)
        idf_vector[_id]=weight
    return torch.tensor(idf_vector)

# load the model
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill")
idf = get_tokenizer_idf(tokenizer)

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token



query = "What's the weather in ny now?"
document = "Currently New York is rainy."

# encode the query
feature_query = tokenizer([query], padding=True, truncation=True, return_tensors='pt')
input_ids = feature_query["input_ids"]
batch_size = input_ids.shape[0]
query_vector = torch.zeros(batch_size, tokenizer.vocab_size)
query_vector[torch.arange(batch_size).unsqueeze(-1), input_ids] = 1
query_sparse_vector = query_vector*idf

# encode the document
feature_document = tokenizer([document], padding=True, truncation=True, return_tensors='pt')
output = model(**feature_document)[0]
document_sparse_vector = get_sparse_vector(feature_document, output)


# get similarity score
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
print(sim_score)   # tensor(17.5307, grad_fn=<DotBackward0>)


query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
document_query_token_weight = transform_sparse_vector_to_dict(document_sparse_vector)[0]
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
        

        
# result:
# score in query: 5.7729, score in document: 1.4109, token: ny
# score in query: 4.5684, score in document: 1.4673, token: weather
# score in query: 3.5895, score in document: 0.7473, token: now

The above code sample shows an example of neural sparse search. Although there is no overlap token in the original query and document, this model performs a good match.

📚 Documentation

Overview

Paper: Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers
Fine-tuning sample: opensearch-sparse-model-tuning-sample

The training datasets include MS MARCO, eli5_question_answer, squad_pairs, WikiAnswers, yahoo_answers_title_question, gooaq_pairs, stackexchange_duplicate_questions_body_body, wikihow, S2ORC_title_abstract, stackexchange_duplicate_questions_title-body_title-body, yahoo_answers_question_answer, searchQA_top5_snippets, stackexchange_duplicate_questions_title_title, yahoo_answers_title_answer.

OpenSearch neural sparse feature supports learned sparse retrieval with lucene inverted index. Link: https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/. The indexing and search can be performed with OpenSearch high-level API.

Detailed Search Relevance

Model	Average	Trec Covid	NFCorpus	NQ	HotpotQA	FiQA	ArguAna	Touche	DBPedia	SCIDOCS	FEVER	Climate FEVER	SciFact	Quora
opensearch-neural-sparse-encoding-v1	0.524	0.771	0.360	0.553	0.697	0.376	0.508	0.278	0.447	0.164	0.821	0.263	0.723	0.856
opensearch-neural-sparse-encoding-v2-distill	0.528	0.775	0.347	0.561	0.685	0.374	0.551	0.278	0.435	0.173	0.849	0.249	0.722	0.863
opensearch-neural-sparse-encoding-doc-v1	0.490	0.707	0.352	0.521	0.677	0.344	0.461	0.294	0.412	0.154	0.743	0.202	0.716	0.788
opensearch-neural-sparse-encoding-doc-v2-distill	0.504	0.690	0.343	0.528	0.675	0.357	0.496	0.287	0.418	0.166	0.818	0.224	0.715	0.841
opensearch-neural-sparse-encoding-doc-v2-mini	0.497	0.709	0.336	0.510	0.666	0.338	0.480	0.285	0.407	0.164	0.812	0.216	0.699	0.837

📄 License

This project is licensed under the Apache v2.0 License.

Copyright

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご