OpenSearch Neural Sparse Coding Model v2 Distilled Version Open-Sourced - Efficiently Implement Sparse Retrieval of Queries and Documents

Opensearch Neural Sparse Encoding V2 Distill

Developed by opensearch-project

The OpenSearch Neural Sparse Encoding Model v2 Distilled is an efficient learned sparse retrieval model designed for OpenSearch, capable of encoding queries and documents into 30,522-dimensional sparse vectors.

Text Embedding

Transformers

EnglishOpen Source License:Apache-2.0 #Learned Sparse Retrieval #Zero-shot Performance #Efficient Inference

Downloads 4,964

Release Time : 7/17/2024

Model Overview

This model is primarily used for retrieval tasks, converting queries and documents into sparse vectors, supporting sparse retrieval based on Lucene inverted indexes, and is suitable for various information retrieval scenarios.

Model Features

Efficient Sparse Retrieval

Supports sparse retrieval based on Lucene inverted indexes, improving retrieval efficiency.

Distilled Optimization

Compared to the base model, the parameter count is reduced by half while maintaining or improving performance.

Multi-dataset Training

Training data includes 14 public datasets such as MS MARCO, eli5 Q&A, and squad Q&A pairs.

Semantic Relevance Matching

Even when original texts share no overlapping words, the model can still achieve effective matching through semantic relevance.

Model Capabilities

Text Retrieval

Query Expansion

Document Expansion

Semantic Matching

Use Cases

Information Retrieval

Document Retrieval

Quickly retrieve relevant documents from large document libraries.

Achieves an average NDCG@10 of 0.528 on the BEIR benchmark subset

Q&A System

Used for relevant passage retrieval in Q&A systems.

Achieves an NDCG@10 of 0.561 on the NQ (Natural Questions) dataset

Search Engine

OpenSearch Integration

Serves as the core component for neural sparse retrieval functionality in OpenSearch.

Supports efficient retrieval based on Lucene inverted indexes

🚀 opensearch-neural-sparse-encoding-v2-distill

This is a learned sparse retrieval model that encodes queries and documents into 30522-dimensional sparse vectors. It offers better search relevance, efficiency, and inference speed compared to some previous models, with specific performance varying across different datasets.

🚀 Quick Start

Select the model

When choosing a model, you should consider search relevance, model inference, and retrieval efficiency (FLOPS). We benchmarked the models' zero-shot performance on a subset of the BEIR benchmark, including TrecCovid, NFCorpus, NQ, HotpotQA, FiQA, ArguAna, Touche, DBPedia, SCIDOCS, FEVER, Climate FEVER, SciFact, and Quora.

Overall, the v2 series of models outperform the v1 series in terms of search relevance, efficiency, and inference speed. However, the specific advantages and disadvantages may vary depending on the dataset.

Model	Inference-free for Retrieval	Model Parameters	AVG NDCG@10	AVG FLOPS
opensearch-neural-sparse-encoding-v1		133M	0.524	11.4
opensearch-neural-sparse-encoding-v2-distill		67M	0.528	8.3
opensearch-neural-sparse-encoding-doc-v1	✔️	133M	0.490	2.3
opensearch-neural-sparse-encoding-doc-v2-distill	✔️	67M	0.504	1.8
opensearch-neural-sparse-encoding-doc-v2-mini	✔️	23M	0.497	1.7

✨ Features

Overview

Paper: Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers
Fine-tuning sample: opensearch-sparse-model-tuning-sample

This learned sparse retrieval model encodes queries and documents into 30522-dimensional sparse vectors. The non-zero dimension index represents the corresponding token in the vocabulary, and the weight indicates the importance of the token.

The training datasets include MS MARCO, eli5_question_answer, squad_pairs, WikiAnswers, yahoo_answers_title_question, gooaq_pairs, stackexchange_duplicate_questions_body_body, wikihow, S2ORC_title_abstract, stackexchange_duplicate_questions_title-body_title-body, yahoo_answers_question_answer, searchQA_top5_snippets, stackexchange_duplicate_questions_title_title, and yahoo_answers_title_answer.

OpenSearch neural sparse feature supports learned sparse retrieval with the lucene inverted index. Link: https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/. You can perform indexing and search using the OpenSearch high-level API.

💻 Usage Examples

Basic Usage

This model is designed to run within an OpenSearch cluster. However, you can also use it outside the cluster with the HuggingFace models API.

import itertools
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer


# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    values = torch.log(1 + torch.relu(values))
    values[:,special_token_ids] = 0
    return values
    
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    

# load the model
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v2-distill")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v2-distill")

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token



query = "What's the weather in ny now?"
document = "Currently New York is rainy."

# encode the query & document
feature = tokenizer([query, document], padding=True, truncation=True, return_tensors='pt')
output = model(**feature)[0]
sparse_vector = get_sparse_vector(feature, output)

# get similarity score
sim_score = torch.matmul(sparse_vector[0],sparse_vector[1])
print(sim_score)   # tensor(38.6112, grad_fn=<DotBackward0>)


query_token_weight, document_query_token_weight = transform_sparse_vector_to_dict(sparse_vector)
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
        

        
# result:
# score in query: 2.7273, score in document: 2.9088, token: york
# score in query: 2.5734, score in document: 0.9208, token: now
# score in query: 2.3895, score in document: 1.7237, token: ny
# score in query: 2.2184, score in document: 1.2368, token: weather
# score in query: 1.8693, score in document: 1.4146, token: current
# score in query: 1.5887, score in document: 0.7450, token: today
# score in query: 1.4704, score in document: 0.9247, token: sunny
# score in query: 1.4374, score in document: 1.9737, token: nyc
# score in query: 1.4347, score in document: 1.6019, token: currently
# score in query: 1.1605, score in document: 0.9794, token: climate
# score in query: 1.0944, score in document: 0.7141, token: upstate
# score in query: 1.0471, score in document: 0.5519, token: forecast
# score in query: 0.9268, score in document: 0.6692, token: verve
# score in query: 0.9126, score in document: 0.4486, token: huh
# score in query: 0.8960, score in document: 0.7706, token: greene
# score in query: 0.8779, score in document: 0.7120, token: picturesque
# score in query: 0.8471, score in document: 0.4183, token: pleasantly
# score in query: 0.8079, score in document: 0.2140, token: windy
# score in query: 0.7537, score in document: 0.4925, token: favorable
# score in query: 0.7519, score in document: 2.1456, token: rain
# score in query: 0.7277, score in document: 0.3818, token: skies
# score in query: 0.6995, score in document: 0.8593, token: lena
# score in query: 0.6895, score in document: 0.2410, token: sunshine
# score in query: 0.6621, score in document: 0.3016, token: johnny
# score in query: 0.6604, score in document: 0.1933, token: skyline
# score in query: 0.6117, score in document: 0.2197, token: sasha
# score in query: 0.5962, score in document: 0.0414, token: vibe
# score in query: 0.5381, score in document: 0.7560, token: hardly
# score in query: 0.4582, score in document: 0.4243, token: prevailing
# score in query: 0.4539, score in document: 0.5073, token: unpredictable
# score in query: 0.4350, score in document: 0.8463, token: presently
# score in query: 0.3674, score in document: 0.2496, token: hail
# score in query: 0.3324, score in document: 0.5506, token: shivered
# score in query: 0.3281, score in document: 0.1964, token: wind
# score in query: 0.3052, score in document: 0.5785, token: rudy
# score in query: 0.2797, score in document: 0.0357, token: looming
# score in query: 0.2712, score in document: 0.0870, token: atmospheric
# score in query: 0.2471, score in document: 0.3490, token: vicky
# score in query: 0.2247, score in document: 0.2383, token: sandy
# score in query: 0.2154, score in document: 0.5737, token: crowded
# score in query: 0.1723, score in document: 0.1857, token: chilly
# score in query: 0.1700, score in document: 0.4110, token: blizzard
# score in query: 0.1183, score in document: 0.0613, token: ##cken
# score in query: 0.0923, score in document: 0.6363, token: unrest
# score in query: 0.0624, score in document: 0.2127, token: russ
# score in query: 0.0558, score in document: 0.5542, token: blackout
# score in query: 0.0549, score in document: 0.1589, token: kahn
# score in query: 0.0160, score in document: 0.0566, token: 2020
# score in query: 0.0125, score in document: 0.3753, token: nighttime

The above code sample demonstrates an example of neural sparse search. Even though there are no overlapping tokens in the original query and document, the model still achieves a good match.

📚 Documentation

Detailed Search Relevance

Model	Average	Trec Covid	NFCorpus	NQ	HotpotQA	FiQA	ArguAna	Touche	DBPedia	SCIDOCS	FEVER	Climate FEVER	SciFact	Quora
opensearch-neural-sparse-encoding-v1	0.524	0.771	0.360	0.553	0.697	0.376	0.508	0.278	0.447	0.164	0.821	0.263	0.723	0.856
opensearch-neural-sparse-encoding-v2-distill	0.528	0.775	0.347	0.561	0.685	0.374	0.551	0.278	0.435	0.173	0.849	0.249	0.722	0.863
opensearch-neural-sparse-encoding-doc-v1	0.490	0.707	0.352	0.521	0.677	0.344	0.461	0.294	0.412	0.154	0.743	0.202	0.716	0.788
opensearch-neural-sparse-encoding-doc-v2-distill	0.504	0.690	0.343	0.528	0.675	0.357	0.496	0.287	0.418	0.166	0.818	0.224	0.715	0.841
opensearch-neural-sparse-encoding-doc-v2-mini	0.497	0.709	0.336	0.510	0.666	0.338	0.480	0.285	0.407	0.164	0.812	0.216	0.699	0.837

📄 License

This project is licensed under the Apache v2.0 License.

Copyright

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご