đ opensearch-neural-sparse-encoding-v1
This is a learned sparse retrieval model that encodes queries and documents into 30522-dimensional sparse vectors, supporting retrieval with OpenSearch high - level API.
đ Quick Start
When selecting the model, search relevance, model inference, and retrieval efficiency (FLOPS) should be considered. We benchmarked the models' zero - shot performance on a subset of the BEIR benchmark, including TrecCovid, NFCorpus, NQ, HotpotQA, FiQA, ArguAna, Touche, DBPedia, SCIDOCS, FEVER, Climate FEVER, SciFact, and Quora.
Overall, the v2 series of models have better search relevance, efficiency, and inference speed than the v1 series. However, the specific advantages and disadvantages may vary across different datasets.
Model |
Inference - free for Retrieval |
Model Parameters |
AVG NDCG@10 |
AVG FLOPS |
[opensearch - neural - sparse - encoding - v1](https://huggingface.co/opensearch - project/opensearch - neural - sparse - encoding - v1) |
|
133M |
0.524 |
11.4 |
[opensearch - neural - sparse - encoding - v2 - distill](https://huggingface.co/opensearch - project/opensearch - neural - sparse - encoding - v2 - distill) |
|
67M |
0.528 |
8.3 |
[opensearch - neural - sparse - encoding - doc - v1](https://huggingface.co/opensearch - project/opensearch - neural - sparse - encoding - doc - v1) |
âī¸ |
133M |
0.490 |
2.3 |
[opensearch - neural - sparse - encoding - doc - v2 - distill](https://huggingface.co/opensearch - project/opensearch - neural - sparse - encoding - doc - v2 - distill) |
âī¸ |
67M |
0.504 |
1.8 |
[opensearch - neural - sparse - encoding - doc - v2 - mini](https://huggingface.co/opensearch - project/opensearch - neural - sparse - encoding - doc - v2 - mini) |
âī¸ |
23M |
0.497 |
1.7 |
⨠Features
This is a learned sparse retrieval model. It encodes queries and documents into 30522 - dimensional sparse vectors. The non - zero dimension index represents the corresponding token in the vocabulary, and the weight represents the importance of the token.
This model is trained on the MS MARCO dataset.
OpenSearch neural sparse feature supports learned sparse retrieval with the lucene inverted index. Link: https://opensearch.org/docs/latest/query - dsl/specialized/neural - sparse/. Indexing and search can be performed using the OpenSearch high - level API.
đģ Usage Examples
Basic Usage
import itertools
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
def get_sparse_vector(feature, output):
values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
values = torch.log(1 + torch.relu(values))
values[:,special_token_ids] = 0
return values
def transform_sparse_vector_to_dict(sparse_vector):
sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]
output = []
end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
for i in range(len(end_idxs)-1):
token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
output.append(dict(zip(token_strings, weights)))
return output
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v1")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v1")
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token
query = "What's the weather in ny now?"
document = "Currently New York is rainy."
feature = tokenizer([query, document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature)[0]
sparse_vector = get_sparse_vector(feature, output)
sim_score = torch.matmul(sparse_vector[0],sparse_vector[1])
print(sim_score)
query_token_weight, document_query_token_weight = transform_sparse_vector_to_dict(sparse_vector)
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
if token in document_query_token_weight:
print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
The above code sample shows an example of neural sparse search. Although there is no overlap token in the original query and document, this model performs a good match.
đ Documentation
Detailed Search Relevance
Model |
Average |
Trec Covid |
NFCorpus |
NQ |
HotpotQA |
FiQA |
ArguAna |
Touche |
DBPedia |
SCIDOCS |
FEVER |
Climate FEVER |
SciFact |
Quora |
[opensearch - neural - sparse - encoding - v1](https://huggingface.co/opensearch - project/opensearch - neural - sparse - encoding - v1) |
0.524 |
0.771 |
0.360 |
0.553 |
0.697 |
0.376 |
0.508 |
0.278 |
0.447 |
0.164 |
0.821 |
0.263 |
0.723 |
0.856 |
[opensearch - neural - sparse - encoding - v2 - distill](https://huggingface.co/opensearch - project/opensearch - neural - sparse - encoding - v2 - distill) |
0.528 |
0.775 |
0.347 |
0.561 |
0.685 |
0.374 |
0.551 |
0.278 |
0.435 |
0.173 |
0.849 |
0.249 |
0.722 |
0.863 |
[opensearch - neural - sparse - encoding - doc - v1](https://huggingface.co/opensearch - project/opensearch - neural - sparse - encoding - doc - v1) |
0.490 |
0.707 |
0.352 |
0.521 |
0.677 |
0.344 |
0.461 |
0.294 |
0.412 |
0.154 |
0.743 |
0.202 |
0.716 |
0.788 |
[opensearch - neural - sparse - encoding - doc - v2 - distill](https://huggingface.co/opensearch - project/opensearch - neural - sparse - encoding - doc - v2 - distill) |
0.504 |
0.690 |
0.343 |
0.528 |
0.675 |
0.357 |
0.496 |
0.287 |
0.418 |
0.166 |
0.818 |
0.224 |
0.715 |
0.841 |
[opensearch - neural - sparse - encoding - doc - v2 - mini](https://huggingface.co/opensearch - project/opensearch - neural - sparse - encoding - doc - v2 - mini) |
0.497 |
0.709 |
0.336 |
0.510 |
0.666 |
0.338 |
0.480 |
0.285 |
0.407 |
0.164 |
0.812 |
0.216 |
0.699 |
0.837 |
đ License
This project is licensed under the [Apache v2.0 License](https://github.com/opensearch - project/neural - search/blob/main/LICENSE).
Copyright
Copyright OpenSearch Contributors. See [NOTICE](https://github.com/opensearch - project/neural - search/blob/main/NOTICE) for details.