OpenSearchニューラルスparseコーディングモデルv1のオープンソース化 - 検索の関連性とドキュメント検索を効率的に実現

ホーム

Opensearch Neural Sparse Encoding V1

opensearch-projectによって開発

OpenSearchニューラル疎符号化モデルv1。クエリとドキュメントを30522次元の疎ベクトルにエンコードし、効率的な検索関連性と検索を実現します。

テキスト埋め込み

Transformers

英語オープンソースライセンス:Apache-2.0 #疎ベクトル検索 #ゼロサンプル検索 #効率的な意味マッチング

ダウンロード数 10.20k

リリース時間 : 3/7/2024

モデル概要

これは学習型の疎検索モデルで、クエリとドキュメントを30522次元の疎ベクトルにエンコードし、検索関連性と検索効率の面で優れた性能を発揮します。モデルはMS MARCOデータセットで学習され、Lucene転置索引を使用した学習型疎検索をサポートしています。

モデル特徴

効率的な疎符号化

クエリとドキュメントを30522次元の疎ベクトルにエンコードし、非ゼロ次元のインデックスは語彙表内の対応するトークンを表し、重みはトークンの重要度を表します。

優れた関連性の性能

BEIRベンチマークテストの複数のデータセットで優れた性能を発揮し、平均NDCG@10が0.524に達します。

OpenSearch統合

OpenSearchクラスタ用に設計され、Lucene転置索引を使用した効率的な検索をサポートします。

ゼロサンプル性能

未見のデータセットでも良好な性能を発揮し、微調整なしで使用できます。

モデル能力

テキスト疎符号化

情報検索

クエリ - ドキュメントマッチング

ゼロサンプル転移学習

使用事例

検索エンジン

ドキュメント検索

大規模なドキュメント集合から関連するドキュメントを効率的に検索します。

BEIRベンチマークテストで平均NDCG@10が0.524に達します。

質問応答システム

ユーザーの質問と候補回答をマッチングします。

NQデータセットでNDCG@10が0.553に達します。

専門分野検索

科学文献検索

科学文献データベースから関連する論文を検索します。

SciFactデータセットでNDCG@10が0.723に達します。

医療情報検索

医療関連のドキュメントと情報を検索します。

TrecCovidデータセットでNDCG@10が0.771に達します。

🚀 opensearch-neural-sparse-encoding-v1

このモデルは、学習済みの疎行列検索モデルです。クエリとドキュメントを30522次元の疎ベクトルにエンコードし、OpenSearchの高レベルAPIを用いてインデックス作成と検索を行うことができます。

🚀 クイックスタート

モデルの選択

モデルは、検索の関連性、モデルの推論、および検索効率（FLOPS）を考慮して選択する必要があります。BEIRベンチマークのサブセットであるTrecCovid、NFCorpus、NQ、HotpotQA、FiQA、ArguAna、Touche、DBPedia、SCIDOCS、FEVER、Climate FEVER、SciFact、Quoraで、モデルのゼロショット性能をベンチマークしています。

全体的に、v2シリーズのモデルは、v1シリーズよりも検索の関連性、効率、および推論速度が優れています。具体的な利点と欠点は、データセットによって異なる場合があります。

モデル	推論不要の検索	モデルパラメータ	AVG NDCG@10	AVG FLOPS
opensearch-neural-sparse-encoding-v1		133M	0.524	11.4
opensearch-neural-sparse-encoding-v2-distill		67M	0.528	8.3
opensearch-neural-sparse-encoding-doc-v1	✔️	133M	0.490	2.3
opensearch-neural-sparse-encoding-doc-v2-distill	✔️	67M	0.504	1.8
opensearch-neural-sparse-encoding-doc-v2-mini	✔️	23M	0.497	1.7

📚 ドキュメント

概要

論文: Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers
ファインチューニングのサンプル: opensearch-sparse-model-tuning-sample

これは学習済みの疎行列検索モデルです。クエリとドキュメントを30522次元の疎ベクトルにエンコードします。非ゼロ次元のインデックスは語彙内の対応するトークンを意味し、重みはトークンの重要度を意味します。

このモデルは、MS MARCOデータセットで学習されています。

OpenSearchのニューラル疎行列機能は、Luceneの転置インデックスを使用した学習済みの疎行列検索をサポートしています。リンク: https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/ 。インデックス作成と検索は、OpenSearchの高レベルAPIを使用して実行できます。

💻 使用例

基本的な使用法

import itertools
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer


# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    values = torch.log(1 + torch.relu(values))
    values[:,special_token_ids] = 0
    return values
    
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    

# load the model
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v1")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v1")

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token



query = "What's the weather in ny now?"
document = "Currently New York is rainy."

# encode the query & document
feature = tokenizer([query, document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature)[0]
sparse_vector = get_sparse_vector(feature, output)

# get similarity score
sim_score = torch.matmul(sparse_vector[0],sparse_vector[1])
print(sim_score)   # tensor(22.3299, grad_fn=<DotBackward0>)


query_token_weight, document_query_token_weight = transform_sparse_vector_to_dict(sparse_vector)
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
        

        
# result:
# score in query: 2.9262, score in document: 2.1335, token: ny
# score in query: 2.5206, score in document: 1.5277, token: weather
# score in query: 2.0373, score in document: 2.3489, token: york
# score in query: 1.5786, score in document: 0.8752, token: cool
# score in query: 1.4636, score in document: 1.5132, token: current
# score in query: 0.7761, score in document: 0.8860, token: season
# score in query: 0.7560, score in document: 0.6726, token: 2020
# score in query: 0.7222, score in document: 0.6292, token: summer
# score in query: 0.6888, score in document: 0.6419, token: nina
# score in query: 0.6451, score in document: 0.8200, token: storm
# score in query: 0.4698, score in document: 0.7635, token: brooklyn
# score in query: 0.4562, score in document: 0.1208, token: julian
# score in query: 0.3484, score in document: 0.3903, token: wow
# score in query: 0.3439, score in document: 0.4160, token: usa
# score in query: 0.2751, score in document: 0.8260, token: manhattan
# score in query: 0.2013, score in document: 0.7735, token: fog
# score in query: 0.1989, score in document: 0.2961, token: mood
# score in query: 0.1653, score in document: 0.3437, token: climate
# score in query: 0.1191, score in document: 0.1533, token: nature
# score in query: 0.0665, score in document: 0.0600, token: temperature
# score in query: 0.0552, score in document: 0.3396, token: windy

上記のコードサンプルは、ニューラル疎行列検索の例を示しています。元のクエリとドキュメントに重複するトークンがないにもかかわらず、このモデルは良好なマッチングを行います。

詳細な検索関連性

モデル	平均	Trec Covid	NFCorpus	NQ	HotpotQA	FiQA	ArguAna	Touche	DBPedia	SCIDOCS	FEVER	Climate FEVER	SciFact	Quora
opensearch-neural-sparse-encoding-v1	0.524	0.771	0.360	0.553	0.697	0.376	0.508	0.278	0.447	0.164	0.821	0.263	0.723	0.856
opensearch-neural-sparse-encoding-v2-distill	0.528	0.775	0.347	0.561	0.685	0.374	0.551	0.278	0.435	0.173	0.849	0.249	0.722	0.863
opensearch-neural-sparse-encoding-doc-v1	0.490	0.707	0.352	0.521	0.677	0.344	0.461	0.294	0.412	0.154	0.743	0.202	0.716	0.788
opensearch-neural-sparse-encoding-doc-v2-distill	0.504	0.690	0.343	0.528	0.675	0.357	0.496	0.287	0.418	0.166	0.818	0.224	0.715	0.841
opensearch-neural-sparse-encoding-doc-v2-mini	0.497	0.709	0.336	0.510	0.666	0.338	0.480	0.285	0.407	0.164	0.812	0.216	0.699	0.837