opensearch-neural-sparse-encoding-doc-v2-mini Open-source Search Model

Opensearch Neural Sparse Encoding Doc V2 Mini

Developed by opensearch-project

OpenSearch's learned sparse retrieval model v2 mini version, encoding documents into sparse vectors to optimize search relevance and efficiency

Text Embedding

Transformers

EnglishOpen Source License:Apache-2.0 #Learned Sparse Retrieval #Efficient Document Expansion #Zero-shot Performance

Downloads 113

Release Time : 7/18/2024

Model Overview

This is a learned sparse retrieval model specifically designed for OpenSearch. It encodes documents into 30522-dimensional sparse vectors and calculates similarity scores through the inner product of query and document sparse vectors. Compared to the v1 series, the v2 series shows improvements in search relevance, efficiency, and inference speed.

Model Features

Efficient Sparse Encoding

Encodes documents into 30522-dimensional sparse vectors, optimizing storage and retrieval efficiency

Inference-Free Retrieval

No model inference required during retrieval; directly uses pre-computed sparse vectors

Performance Optimization

Compared to the v1 series, the v2 series shows improvements in search relevance and inference speed

OpenSearch Integration

Designed specifically for OpenSearch, supporting retrieval based on Lucene inverted indexes

Model Capabilities

Document Sparse Encoding

Efficient Similarity Calculation

Large-scale Document Retrieval

Zero-shot Retrieval

Use Cases

Information Retrieval

Document Search

Quickly retrieves relevant content from large-scale document collections

Achieves an average NDCG@10 of 0.497 on BEIR benchmark subsets

Question Answering Systems

Serves as the retrieval component for QA systems to quickly find relevant passages

NDCG@10 of 0.510 on the NQ (Natural Questions) dataset

Enterprise Search

Internal Document Retrieval

Helps enterprises quickly search internal knowledge bases and documents

🚀 opensearch-neural-sparse-encoding-doc-v2-mini

This project offers a learned sparse retrieval model, which encodes documents into sparse vectors for efficient search. It supports retrieval with OpenSearch and provides good search relevance and efficiency.

🚀 Quick Start

When selecting a model, search relevance, model inference, and retrieval efficiency (FLOPS) should be considered. We benchmarked the models' zero-shot performance on a subset of the BEIR benchmark, including TrecCovid, NFCorpus, NQ, HotpotQA, FiQA, ArguAna, Touche, DBPedia, SCIDOCS, FEVER, Climate FEVER, SciFact, and Quora.

Overall, the v2 series of models have better search relevance, efficiency, and inference speed than the v1 series. However, the specific advantages and disadvantages may vary across different datasets.

Model	Inference-free for Retrieval	Model Parameters	AVG NDCG@10	AVG FLOPS
opensearch-neural-sparse-encoding-v1		133M	0.524	11.4
opensearch-neural-sparse-encoding-v2-distill		67M	0.528	8.3
opensearch-neural-sparse-encoding-doc-v1	✔️	133M	0.490	2.3
opensearch-neural-sparse-encoding-doc-v2-distill	✔️	67M	0.504	1.8
opensearch-neural-sparse-encoding-doc-v2-mini	✔️	23M	0.497	1.7

✨ Features

Paper: Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers
Fine-tuning sample: opensearch-sparse-model-tuning-sample

This is a learned sparse retrieval model that encodes documents into 30522-dimensional sparse vectors. For queries, it uses a tokenizer and a weight look-up table to generate sparse vectors. The non-zero dimension index represents the corresponding token in the vocabulary, and the weight indicates the importance of the token. The similarity score is the inner product of the query and document sparse vectors.

The training datasets include MS MARCO, eli5_question_answer, squad_pairs, WikiAnswers, yahoo_answers_title_question, gooaq_pairs, stackexchange_duplicate_questions_body_body, wikihow, S2ORC_title_abstract, stackexchange_duplicate_questions_title-body_title-body, yahoo_answers_question_answer, searchQA_top5_snippets, stackexchange_duplicate_questions_title_title, and yahoo_answers_title_answer.

OpenSearch neural sparse feature supports learned sparse retrieval with the lucene inverted index. Link: https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/. Indexing and search can be performed using the OpenSearch high-level API.

💻 Usage Examples

Basic Usage

import json
import itertools
import torch

from transformers import AutoModelForMaskedLM, AutoTokenizer


# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    values = torch.log(1 + torch.relu(values))
    values[:,special_token_ids] = 0
    return values
    
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    
# download the idf file from model hub. idf is used to give weights for query tokens
def get_tokenizer_idf(tokenizer):
    from huggingface_hub import hf_hub_download
    local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini", filename="idf.json")
    with open(local_cached_path) as f:
        idf = json.load(f)
    idf_vector = [0]*tokenizer.vocab_size
    for token,weight in idf.items():
        _id = tokenizer._convert_token_to_id_with_added_voc(token)
        idf_vector[_id]=weight
    return torch.tensor(idf_vector)

# load the model
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini")
idf = get_tokenizer_idf(tokenizer)

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token



query = "What's the weather in ny now?"
document = "Currently New York is rainy."

# encode the query
feature_query = tokenizer([query], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
input_ids = feature_query["input_ids"]
batch_size = input_ids.shape[0]
query_vector = torch.zeros(batch_size, tokenizer.vocab_size)
query_vector[torch.arange(batch_size).unsqueeze(-1), input_ids] = 1
query_sparse_vector = query_vector*idf

# encode the document
feature_document = tokenizer([document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature_document)[0]
document_sparse_vector = get_sparse_vector(feature_document, output)


# get similarity score
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
print(sim_score)   # tensor(13.8344, grad_fn=<DotBackward0>)


query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
document_query_token_weight = transform_sparse_vector_to_dict(document_sparse_vector)[0]
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
        

        
# result:
# score in query: 5.7729, score in document: 1.0251, token: ny
# score in query: 4.5684, score in document: 1.1145, token: weather
# score in query: 3.5895, score in document: 0.5356, token: now
# score in query: 3.3313, score in document: 0.2710, token: ?

The above code sample shows an example of neural sparse search. Although there is no overlap token in the original query and document, this model performs a good match.

📚 Documentation

Detailed Search Relevance

Model	Average	Trec Covid	NFCorpus	NQ	HotpotQA	FiQA	ArguAna	Touche	DBPedia	SCIDOCS	FEVER	Climate FEVER	SciFact	Quora
opensearch-neural-sparse-encoding-v1	0.524	0.771	0.360	0.553	0.697	0.376	0.508	0.278	0.447	0.164	0.821	0.263	0.723	0.856
opensearch-neural-sparse-encoding-v2-distill	0.528	0.775	0.347	0.561	0.685	0.374	0.551	0.278	0.435	0.173	0.849	0.249	0.722	0.863
opensearch-neural-sparse-encoding-doc-v1	0.490	0.707	0.352	0.521	0.677	0.344	0.461	0.294	0.412	0.154	0.743	0.202	0.716	0.788
opensearch-neural-sparse-encoding-doc-v2-distill	0.504	0.690	0.343	0.528	0.675	0.357	0.496	0.287	0.418	0.166	0.818	0.224	0.715	0.841
opensearch-neural-sparse-encoding-doc-v2-mini	0.497	0.709	0.336	0.510	0.666	0.338	0.480	0.285	0.407	0.164	0.812	0.216	0.699	0.837

📄 License

This project is licensed under the Apache v2.0 License.

Copyright

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご