đ opensearch-neural-sparse-encoding-doc-v3-distill
This is a learned sparse retrieval model. It encodes documents into 30522-dimensional sparse vectors and uses a tokenizer and a weight look-up table to generate sparse vectors for queries. The similarity score is calculated as the inner product of query and document sparse vectors.
đ Quick Start
When selecting a model, you should consider search relevance, model inference, and retrieval efficiency (FLOPS). We benchmarked the models' performance on a subset of the BEIR benchmark, including TrecCovid, NFCorpus, NQ, HotpotQA, FiQA, ArguAna, Touche, DBPedia, SCIDOCS, FEVER, Climate FEVER, SciFact, and Quora.
Overall, the v3 series of models offer better search relevance, efficiency, and inference speed compared to the v1 and v2 series. However, the specific advantages and disadvantages may vary across different datasets.
⨠Features
- Learned Sparse Retrieval: Supports learned sparse retrieval with the Lucene inverted index through OpenSearch neural sparse feature.
- Diverse Training Datasets: Trained on a wide range of datasets, including MS MARCO, eli5_question_answer, squad_pairs, etc.
- OpenSearch Compatibility: Can be used within an OpenSearch cluster or outside with HuggingFace models API.
đģ Usage Examples
Basic Usage
This model is intended to run inside an OpenSearch cluster, but you can also use it outside the cluster with HuggingFace models API.
import json
import itertools
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
def get_sparse_vector(feature, output):
values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
values = torch.log(1 + torch.log(1 + torch.relu(values)))
values[:,special_token_ids] = 0
return values
def transform_sparse_vector_to_dict(sparse_vector):
sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]
output = []
end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
for i in range(len(end_idxs)-1):
token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
output.append(dict(zip(token_strings, weights)))
return output
def get_tokenizer_idf(tokenizer):
from huggingface_hub import hf_hub_download
local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill", filename="idf.json")
with open(local_cached_path) as f:
idf = json.load(f)
idf_vector = [0]*tokenizer.vocab_size
for token,weight in idf.items():
_id = tokenizer._convert_token_to_id_with_added_voc(token)
idf_vector[_id]=weight
return torch.tensor(idf_vector)
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill")
idf = get_tokenizer_idf(tokenizer)
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token
query = "What's the weather in ny now?"
document = "Currently New York is rainy."
feature_query = tokenizer([query], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
input_ids = feature_query["input_ids"]
batch_size = input_ids.shape[0]
query_vector = torch.zeros(batch_size, tokenizer.vocab_size)
query_vector[torch.arange(batch_size).unsqueeze(-1), input_ids] = 1
query_sparse_vector = query_vector*idf
feature_document = tokenizer([document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature_document)[0]
document_sparse_vector = get_sparse_vector(feature_document, output)
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
print(sim_score)
query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
document_query_token_weight = transform_sparse_vector_to_dict(document_sparse_vector)[0]
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
if token in document_query_token_weight:
print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
The above code sample demonstrates an example of neural sparse search. Even though there are no overlapping tokens in the original query and document, the model performs a good match.
đ Documentation
Detailed Search Relevance
The following table shows the detailed search relevance performance of different models on various datasets:
Training Datasets
The training datasets include MS MARCO, eli5_question_answer, squad_pairs, WikiAnswers, yahoo_answers_title_question, gooaq_pairs, stackexchange_duplicate_questions_body_body, wikihow, S2ORC_title_abstract, stackexchange_duplicate_questions_title-body_title-body, yahoo_answers_question_answer, searchQA_top5_snippets, stackexchange_duplicate_questions_title_title, yahoo_answers_title_answer, fever, fiqa, hotpotqa, nfcorpus, scifact.
OpenSearch Integration
OpenSearch neural sparse feature supports learned sparse retrieval with the Lucene inverted index. You can perform indexing and search using OpenSearch high-level API. Link: https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/.
đ License
This project is licensed under the Apache v2.0 License.
Copyright
Copyright OpenSearch Contributors. See NOTICE for details.