opensearch-neural-sparse-encoding-doc-v3-distill Open Source Model - Empowering Efficient Document Retrieval in OpenSearch

Opensearch Neural Sparse Encoding Doc V3 Distill

Developed by opensearch-project

A document-level learned sparse retrieval model specifically designed for OpenSearch, optimized with distillation technology to support efficient document retrieval

Text Embedding

Transformers

EnglishOpen Source License:Apache-2.0 #Inference-free Sparse Retrieval #Document Expansion Specialized #Efficient Passage Search

Downloads 243

Release Time : 3/28/2025

Model Overview

This model encodes documents into 30522-dimensional sparse vectors, suitable for document retrieval tasks, with special optimizations for retrieval efficiency in OpenSearch

Model Features

Inference-free Retrieval

Document processing requires no inference computation, directly generating sparse vectors to significantly reduce computational costs

Efficient Retrieval

Average FLOPS only 1.8, significantly improving efficiency compared to previous generation models

Improved Relevance

Achieves average NDCG@10 of 0.517 on BEIR benchmark, outperforming previous document-specialized models

Large-scale Training

Trained on diverse QA and document datasets including MS MARCO, WikiAnswers, etc.

Model Capabilities

Document Retrieval

Sparse Vector Generation

Semantic Matching

Cross-domain Retrieval

Use Cases

Search Engine

OpenSearch Document Retrieval

Serves as OpenSearch's neural sparse retrieval component, providing efficient document search capabilities

Delivers better semantic matching effects compared to traditional retrieval methods

QA Systems

QA Pair Retrieval

Retrieves the most relevant QA pairs from knowledge base for user questions

Performs well on QA datasets like NQ

🚀 opensearch-neural-sparse-encoding-doc-v3-distill

This is a learned sparse retrieval model. It encodes documents into 30522-dimensional sparse vectors and uses a tokenizer and a weight look-up table to generate sparse vectors for queries. The similarity score is calculated as the inner product of query and document sparse vectors.

🚀 Quick Start

When selecting a model, you should consider search relevance, model inference, and retrieval efficiency (FLOPS). We benchmarked the models' performance on a subset of the BEIR benchmark, including TrecCovid, NFCorpus, NQ, HotpotQA, FiQA, ArguAna, Touche, DBPedia, SCIDOCS, FEVER, Climate FEVER, SciFact, and Quora.

Overall, the v3 series of models offer better search relevance, efficiency, and inference speed compared to the v1 and v2 series. However, the specific advantages and disadvantages may vary across different datasets.

Model	Inference-free for Retrieval	Model Parameters	AVG NDCG@10	AVG FLOPS
opensearch-neural-sparse-encoding-v1		133M	0.524	11.4
opensearch-neural-sparse-encoding-v2-distill		67M	0.528	8.3
opensearch-neural-sparse-encoding-doc-v1	✔️	133M	0.490	2.3
opensearch-neural-sparse-encoding-doc-v2-distill	✔️	67M	0.504	1.8
opensearch-neural-sparse-encoding-doc-v2-mini	✔️	23M	0.497	1.7
opensearch-neural-sparse-encoding-doc-v3-distill	✔️	67M	0.517	1.8

✨ Features

Learned Sparse Retrieval: Supports learned sparse retrieval with the Lucene inverted index through OpenSearch neural sparse feature.
Diverse Training Datasets: Trained on a wide range of datasets, including MS MARCO, eli5_question_answer, squad_pairs, etc.
OpenSearch Compatibility: Can be used within an OpenSearch cluster or outside with HuggingFace models API.

💻 Usage Examples

Basic Usage

This model is intended to run inside an OpenSearch cluster, but you can also use it outside the cluster with HuggingFace models API.

import json
import itertools
import torch

from transformers import AutoModelForMaskedLM, AutoTokenizer


# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    # note we update the activation for v3 model
    values = torch.log(1 + torch.log(1 + torch.relu(values)))
    values[:,special_token_ids] = 0
    return values
    
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    
# download the idf file from model hub. idf is used to give weights for query tokens
def get_tokenizer_idf(tokenizer):
    from huggingface_hub import hf_hub_download
    local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill", filename="idf.json")
    with open(local_cached_path) as f:
        idf = json.load(f)
    idf_vector = [0]*tokenizer.vocab_size
    for token,weight in idf.items():
        _id = tokenizer._convert_token_to_id_with_added_voc(token)
        idf_vector[_id]=weight
    return torch.tensor(idf_vector)

# load the model
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill")
idf = get_tokenizer_idf(tokenizer)

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token



query = "What's the weather in ny now?"
document = "Currently New York is rainy."

# encode the query
feature_query = tokenizer([query], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
input_ids = feature_query["input_ids"]
batch_size = input_ids.shape[0]
query_vector = torch.zeros(batch_size, tokenizer.vocab_size)
query_vector[torch.arange(batch_size).unsqueeze(-1), input_ids] = 1
query_sparse_vector = query_vector*idf

# encode the document
feature_document = tokenizer([document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature_document)[0]
document_sparse_vector = get_sparse_vector(feature_document, output)


# get similarity score
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
print(sim_score)   # tensor(11.1105, grad_fn=<DotBackward0>)


query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
document_query_token_weight = transform_sparse_vector_to_dict(document_sparse_vector)[0]
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
        

        
# result:
# score in query: 5.7729, score in document: 0.8049, token: ny
# score in query: 4.5684, score in document: 0.9710, token: weather
# score in query: 3.5895, score in document: 0.4720, token: now
# score in query: 3.3313, score in document: 0.0286, token: ?
# score in query: 2.7699, score in document: 0.0787, token: what
# score in query: 0.4989, score in document: 0.0417, token: in

The above code sample demonstrates an example of neural sparse search. Even though there are no overlapping tokens in the original query and document, the model performs a good match.

📚 Documentation

Detailed Search Relevance

The following table shows the detailed search relevance performance of different models on various datasets:

Model	Average	Trec Covid	NFCorpus	NQ	HotpotQA	FiQA	ArguAna	Touche	DBPedia	SCIDOCS	FEVER	Climate FEVER	SciFact	Quora
opensearch-neural-sparse-encoding-v1	0.524	0.771	0.360	0.553	0.697	0.376	0.508	0.278	0.447	0.164	0.821	0.263	0.723	0.856
opensearch-neural-sparse-encoding-v2-distill	0.528	0.775	0.347	0.561	0.685	0.374	0.551	0.278	0.435	0.173	0.849	0.249	0.722	0.863
opensearch-neural-sparse-encoding-doc-v1	0.490	0.707	0.352	0.521	0.677	0.344	0.461	0.294	0.412	0.154	0.743	0.202	0.716	0.788
opensearch-neural-sparse-encoding-doc-v2-distill	0.504	0.690	0.343	0.528	0.675	0.357	0.496	0.287	0.418	0.166	0.818	0.224	0.715	0.841
opensearch-neural-sparse-encoding-doc-v2-mini	0.497	0.709	0.336	0.510	0.666	0.338	0.480	0.285	0.407	0.164	0.812	0.216	0.699	0.837
opensearch-neural-sparse-encoding-doc-v3-distill	0.517	0.724	0.345	0.544	0.694	0.356	0.520	0.294	0.424	0.163	0.845	0.239	0.708	0.863

Training Datasets

The training datasets include MS MARCO, eli5_question_answer, squad_pairs, WikiAnswers, yahoo_answers_title_question, gooaq_pairs, stackexchange_duplicate_questions_body_body, wikihow, S2ORC_title_abstract, stackexchange_duplicate_questions_title-body_title-body, yahoo_answers_question_answer, searchQA_top5_snippets, stackexchange_duplicate_questions_title_title, yahoo_answers_title_answer, fever, fiqa, hotpotqa, nfcorpus, scifact.

OpenSearch Integration

OpenSearch neural sparse feature supports learned sparse retrieval with the Lucene inverted index. You can perform indexing and search using OpenSearch high-level API. Link: https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/.

📄 License

This project is licensed under the Apache v2.0 License.

Copyright

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご