Opensearch - ニューラルスパースエンコーディングマルチリンガル - v1オープンソースモデル

ホーム

Opensearch Neural Sparse Encoding Multilingual V1

opensearch-projectによって開発

15言語をサポートする学習型スパース検索モデルで、OpenSearch専用に設計されており、推論不要で効率的な検索を実現

テキスト埋め込み

Transformers

複数言語対応オープンソースライセンス:Apache-2.0 #多言語スパース検索 #推論不要検索 #高次元スパースベクトル

ダウンロード数 121

リリース時間 : 2/21/2025

モデル概要

このモデルはドキュメントを105879次元のスパースベクトルにエンコードし、トークン重み付けによる効率的な検索を実現、OpenSearchのニューラルスパース特性をサポート

モデル特徴

推論不要検索

検索時にはトークナイザーと重みルックアップテーブルを使用してスパースベクトルを生成するだけで、完全なモデル推論は不要

多言語サポート

15言語の言語横断ドキュメント検索をサポート

効率的なスパースエンコーディング

ドキュメントを105879次元のスパースベクトルにエンコードし、検索効率を最適化

OpenSearch統合

OpenSearch専用設計で、Lucene転置インデックスによる学習型スパース検索をサポート

モデル能力

多言語ドキュメント検索

スパースベクトル生成

効率的な類似度計算

言語横断検索

使用事例

情報検索

多言語ドキュメント検索

多言語ドキュメントライブラリで効率的な検索を実現

MIRACLベンチマークテストで平均NDCG@10が0.629を達成

エンタープライズ検索

企業内の多言語ドキュメント検索システムに使用

従来のBM25手法に比べて大幅な改善

🚀 opensearch-neural-sparse-encoding-multilingual-v1

このモデルは、検索の関連性、モデルの推論、および検索効率（FLOPS）を考慮して選択する必要があります。私たちは、MIRACLベンチマークでモデルのパフォーマンスを評価しています（uncasedのバックボーンがエンコードできないため、thは除外しています）。 最大比率のプルーニングを使用することをおすすめします。

モデル	推論なしの検索	モデルパラメータ	AVG NDCG@10	AVG FLOPS	AVG EMB SIZE
opensearch-neural-sparse-encoding-multilingual-v1	✔️	160M	0.629	1.3	138
opensearch-neural-sparse-encoding-multilingual-v1; prune_ratio 0.1	✔️	160M	0.626	0.8	75

📚 ドキュメント

論文: Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers
微調整サンプル: opensearch-sparse-model-tuning-sample

これは学習済みの疎な検索モデルです。ドキュメントを105879次元の疎ベクトルにエンコードします。クエリについては、トークナイザと重みのルックアップテーブルを使用して疎ベクトルを生成します。非ゼロ次元インデックスは語彙内の対応するトークンを意味し、重みはトークンの重要度を意味します。そして、類似度スコアはクエリとドキュメントの疎ベクトルの内積です。

OpenSearchのニューラル疎特徴は、Luceneの転置インデックスを使用した学習済みの疎な検索をサポートしています。リンク: https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/ 。インデックス作成と検索は、OpenSearchの高レベルAPIを使用して実行できます。

💻 使用例

基本的な使用法

import json
import itertools
import torch

from transformers import AutoModelForMaskedLM, AutoTokenizer


# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output, prune_ratio=0.1):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    values = torch.log(1 + torch.relu(values))
    values[:,special_token_ids] = 0
    max_values = values.max(dim=-1)[0].unsqueeze(1) * prune_ratio
    return values * (values > max_values)
    
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    
# download the idf file from model hub. idf is used to give weights for query tokens
def get_tokenizer_idf(tokenizer):
    from huggingface_hub import hf_hub_download
    local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1", filename="idf.json")
    with open(local_cached_path) as f:
        idf = json.load(f)
    idf_vector = [0]*tokenizer.vocab_size
    for token,weight in idf.items():
        _id = tokenizer._convert_token_to_id_with_added_voc(token)
        idf_vector[_id]=weight
    return torch.tensor(idf_vector)

# load the model
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1")
idf = get_tokenizer_idf(tokenizer)

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token



query = "What's the weather in ny now?"
document = "Currently New York is rainy."

# encode the query
feature_query = tokenizer([query], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
input_ids = feature_query["input_ids"]
batch_size = input_ids.shape[0]
query_vector = torch.zeros(batch_size, tokenizer.vocab_size)
query_vector[torch.arange(batch_size).unsqueeze(-1), input_ids] = 1
query_sparse_vector = query_vector*idf

# encode the document
feature_document = tokenizer([document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature_document)[0]
document_sparse_vector = get_sparse_vector(feature_document, output)


# get similarity score
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
print(sim_score)   # tensor(7.6317, grad_fn=<DotBackward0>)


query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
document_query_token_weight = transform_sparse_vector_to_dict(document_sparse_vector)[0]
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
        

        
# result:
# score in query: 3.0699, score in document: 1.2821, token: weather
# score in query: 1.6406, score in document: 0.9018, token: now
# score in query: 1.6108, score in document: 0.3141, token: ?
# score in query: 1.2721, score in document: 1.3446, token: ny

上記のコードサンプルは、ニューラル疎検索の例を示しています。元のクエリとドキュメントに重複するトークンがないにもかかわらず、このモデルは良好なマッチングを行います。

詳細な検索関連性

モデル	平均	bn	te	es	fr	id	hi	ru	ar	zh	fa	ja	fi	sw	ko	en
BM25	0.305	0.482	0.383	0.077	0.115	0.297	0.350	0.256	0.395	0.175	0.287	0.312	0.458	0.351	0.371	0.267
opensearch-neural-sparse-encoding-multilingual-v1	0.629	0.670	0.740	0.542	0.558	0.582	0.486	0.658	0.740	0.562	0.514	0.669	0.767	0.768	0.607	0.575
opensearch-neural-sparse-encoding-multilingual-v1; prune_ratio 0.1	0.626	0.667	0.740	0.537	0.555	0.576	0.481	0.655	0.737	0.558	0.511	0.664	0.761	0.766	0.604	0.572