đ opensearch-neural-sparse-encoding-doc-v2-mini
This project offers a learned sparse retrieval model, which encodes documents into sparse vectors for efficient search. It supports retrieval with OpenSearch and provides good search relevance and efficiency.
đ Quick Start
When selecting a model, search relevance, model inference, and retrieval efficiency (FLOPS) should be considered. We benchmarked the models' zero-shot performance on a subset of the BEIR benchmark, including TrecCovid, NFCorpus, NQ, HotpotQA, FiQA, ArguAna, Touche, DBPedia, SCIDOCS, FEVER, Climate FEVER, SciFact, and Quora.
Overall, the v2 series of models have better search relevance, efficiency, and inference speed than the v1 series. However, the specific advantages and disadvantages may vary across different datasets.
⨠Features
This is a learned sparse retrieval model that encodes documents into 30522-dimensional sparse vectors. For queries, it uses a tokenizer and a weight look-up table to generate sparse vectors. The non-zero dimension index represents the corresponding token in the vocabulary, and the weight indicates the importance of the token. The similarity score is the inner product of the query and document sparse vectors.
The training datasets include MS MARCO, eli5_question_answer, squad_pairs, WikiAnswers, yahoo_answers_title_question, gooaq_pairs, stackexchange_duplicate_questions_body_body, wikihow, S2ORC_title_abstract, stackexchange_duplicate_questions_title-body_title-body, yahoo_answers_question_answer, searchQA_top5_snippets, stackexchange_duplicate_questions_title_title, and yahoo_answers_title_answer.
OpenSearch neural sparse feature supports learned sparse retrieval with the lucene inverted index. Link: https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/. Indexing and search can be performed using the OpenSearch high-level API.
đģ Usage Examples
Basic Usage
import json
import itertools
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
def get_sparse_vector(feature, output):
values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
values = torch.log(1 + torch.relu(values))
values[:,special_token_ids] = 0
return values
def transform_sparse_vector_to_dict(sparse_vector):
sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]
output = []
end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
for i in range(len(end_idxs)-1):
token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
output.append(dict(zip(token_strings, weights)))
return output
def get_tokenizer_idf(tokenizer):
from huggingface_hub import hf_hub_download
local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini", filename="idf.json")
with open(local_cached_path) as f:
idf = json.load(f)
idf_vector = [0]*tokenizer.vocab_size
for token,weight in idf.items():
_id = tokenizer._convert_token_to_id_with_added_voc(token)
idf_vector[_id]=weight
return torch.tensor(idf_vector)
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v2-mini")
idf = get_tokenizer_idf(tokenizer)
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token
query = "What's the weather in ny now?"
document = "Currently New York is rainy."
feature_query = tokenizer([query], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
input_ids = feature_query["input_ids"]
batch_size = input_ids.shape[0]
query_vector = torch.zeros(batch_size, tokenizer.vocab_size)
query_vector[torch.arange(batch_size).unsqueeze(-1), input_ids] = 1
query_sparse_vector = query_vector*idf
feature_document = tokenizer([document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature_document)[0]
document_sparse_vector = get_sparse_vector(feature_document, output)
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
print(sim_score)
query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
document_query_token_weight = transform_sparse_vector_to_dict(document_sparse_vector)[0]
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
if token in document_query_token_weight:
print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
The above code sample shows an example of neural sparse search. Although there is no overlap token in the original query and document, this model performs a good match.
đ Documentation
Detailed Search Relevance
đ License
This project is licensed under the Apache v2.0 License.
Copyright
Copyright OpenSearch Contributors. See NOTICE for details.