đ opensearch-neural-sparse-encoding-doc-v2-distill
This is a learned sparse retrieval model that encodes documents into 30522-dimensional sparse vectors, enabling efficient retrieval with OpenSearch.
đ Quick Start
This model is designed for learned sparse retrieval. It encodes documents into 30522-dimensional sparse vectors. For queries, it uses a tokenizer and a weight look-up table to generate sparse vectors. The similarity score is calculated as the inner product of query and document sparse vectors.
⨠Features
- High Search Relevance: The v2 series of models generally offer better search relevance compared to the v1 series, as demonstrated by benchmark results on multiple datasets.
- Inference-Free Retrieval: Some models support inference-free retrieval, which can significantly improve retrieval efficiency.
- Efficient Encoding: Encodes documents into sparse vectors, reducing storage and computational requirements.
đĻ Model Selection
When selecting a model, consider search relevance, model inference, and retrieval efficiency (FLOPS). We benchmarked the models' zero-shot performance on a subset of the BEIR benchmark, including TrecCovid, NFCorpus, NQ, HotpotQA, FiQA, ArguAna, Touche, DBPedia, SCIDOCS, FEVER, Climate FEVER, SciFact, and Quora.
Overall, the v2 series of models have better search relevance, efficiency, and inference speed than the v1 series. However, the specific advantages and disadvantages may vary across different datasets.
đģ Usage Examples
Basic Usage
This model is supposed to run inside an OpenSearch cluster. But you can also use it outside the cluster, with HuggingFace models API.
import json
import itertools
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
def get_sparse_vector(feature, output):
values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
values = torch.log(1 + torch.relu(values))
values[:,special_token_ids] = 0
return values
def transform_sparse_vector_to_dict(sparse_vector):
sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]
output = []
end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
for i in range(len(end_idxs)-1):
token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
output.append(dict(zip(token_strings, weights)))
return output
def get_tokenizer_idf(tokenizer):
from huggingface_hub import hf_hub_download
local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill", filename="idf.json")
with open(local_cached_path) as f:
idf = json.load(f)
idf_vector = [0]*tokenizer.vocab_size
for token,weight in idf.items():
_id = tokenizer._convert_token_to_id_with_added_voc(token)
idf_vector[_id]=weight
return torch.tensor(idf_vector)
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill")
idf = get_tokenizer_idf(tokenizer)
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token
query = "What's the weather in ny now?"
document = "Currently New York is rainy."
feature_query = tokenizer([query], padding=True, truncation=True, return_tensors='pt')
input_ids = feature_query["input_ids"]
batch_size = input_ids.shape[0]
query_vector = torch.zeros(batch_size, tokenizer.vocab_size)
query_vector[torch.arange(batch_size).unsqueeze(-1), input_ids] = 1
query_sparse_vector = query_vector*idf
feature_document = tokenizer([document], padding=True, truncation=True, return_tensors='pt')
output = model(**feature_document)[0]
document_sparse_vector = get_sparse_vector(feature_document, output)
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
print(sim_score)
query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
document_query_token_weight = transform_sparse_vector_to_dict(document_sparse_vector)[0]
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
if token in document_query_token_weight:
print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
The above code sample shows an example of neural sparse search. Although there is no overlap token in the original query and document, this model performs a good match.
đ Documentation
Overview
The training datasets include MS MARCO, eli5_question_answer, squad_pairs, WikiAnswers, yahoo_answers_title_question, gooaq_pairs, stackexchange_duplicate_questions_body_body, wikihow, S2ORC_title_abstract, stackexchange_duplicate_questions_title-body_title-body, yahoo_answers_question_answer, searchQA_top5_snippets, stackexchange_duplicate_questions_title_title, yahoo_answers_title_answer.
OpenSearch neural sparse feature supports learned sparse retrieval with lucene inverted index. Link: https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/. The indexing and search can be performed with OpenSearch high-level API.
Detailed Search Relevance
đ License
This project is licensed under the Apache v2.0 License.
Copyright
Copyright OpenSearch Contributors. See NOTICE for details.