opensearch - neural - sparse - encoding - multilingual - v1開源模型

首頁

Opensearch Neural Sparse Encoding Multilingual V1

由opensearch-project開發

一個支持15種語言的學習型稀疏檢索模型，專為OpenSearch設計，無需推理即可實現高效檢索

文本嵌入

Transformers

支持多種語言開源協議:Apache-2.0 #多語言稀疏檢索 #無推理搜索 #高維稀疏向量

下載量 121

發布時間 : 2/21/2025

模型概述

該模型將文檔編碼為105879維稀疏向量，通過詞元權重實現高效檢索，支持OpenSearch神經稀疏特性

模型特點

無需推理檢索

檢索時僅需使用分詞器和權重查找表生成稀疏向量，無需完整模型推理

多語言支持

支持15種語言的跨語言文檔檢索

高效稀疏編碼

將文檔編碼為105879維稀疏向量，優化檢索效率

OpenSearch集成

專為OpenSearch設計，支持通過Lucene倒排索引實現學習型稀疏檢索

模型能力

多語言文檔檢索

稀疏向量生成

高效相似度計算

跨語言搜索

使用案例

信息檢索

多語言文檔搜索

在多語言文檔庫中實現高效檢索

在MIRACL基準測試中平均NDCG@10達到0.629

企業搜索

用於企業內部多語言文檔的搜索系統

相比傳統BM25方法有顯著提升

🚀 多語言v1版OpenSearch神經稀疏編碼模型

本項目是一個多語言的學習型稀疏檢索模型，可將文檔編碼為高維稀疏向量，通過內積計算相似度得分，在多語言檢索任務中表現出色。

🚀 快速開始

模型選擇

選擇模型時，應綜合考慮搜索相關性、模型推理和檢索效率（FLOPS）。我們在 MIRACL 基準測試中對模型性能進行了評估（由於無大小寫區分的主幹模型無法對其進行編碼，因此排除了 th 語言）。 我們建議使用最大比率剪枝法。

模型	免推理檢索	模型參數	平均NDCG@10	平均FLOPS	平均嵌入大小
opensearch-neural-sparse-encoding-multilingual-v1	✔️	1.6億	0.629	1.3	138
opensearch-neural-sparse-encoding-multilingual-v1; 剪枝率 0.1	✔️	1.6億	0.626	0.8	75

✨ 主要特性

模型概述

論文：Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers
微調示例：opensearch-sparse-model-tuning-sample

這是一個學習型稀疏檢索模型，它將文檔編碼為 105879 維的 稀疏向量。對於查詢，它僅使用分詞器和權重查找表來生成稀疏向量。非零維度索引表示詞彙表中對應的標記，權重表示該標記的重要性。相似度得分是查詢/文檔稀疏向量的內積。

OpenSearch 神經稀疏特徵支持使用 Lucene 倒排索引進行學習型稀疏檢索。鏈接：https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/。可以使用 OpenSearch 高級 API 進行索引和搜索。

詳細搜索相關性

模型	平均	孟加拉語	泰盧固語	西班牙語	法語	印尼語	印地語	俄語	阿拉伯語	中文	波斯語	日語	芬蘭語	斯瓦希里語	韓語	英語
BM25	0.305	0.482	0.383	0.077	0.115	0.297	0.350	0.256	0.395	0.175	0.287	0.312	0.458	0.351	0.371	0.267
opensearch-neural-sparse-encoding-multilingual-v1	0.629	0.670	0.740	0.542	0.558	0.582	0.486	0.658	0.740	0.562	0.514	0.669	0.767	0.768	0.607	0.575
opensearch-neural-sparse-encoding-multilingual-v1; 剪枝率 0.1	0.626	0.667	0.740	0.537	0.555	0.576	0.481	0.655	0.737	0.558	0.511	0.664	0.761	0.766	0.604	0.572

💻 使用示例

基礎用法

import json
import itertools
import torch

from transformers import AutoModelForMaskedLM, AutoTokenizer


# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output, prune_ratio=0.1):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    values = torch.log(1 + torch.relu(values))
    values[:,special_token_ids] = 0
    max_values = values.max(dim=-1)[0].unsqueeze(1) * prune_ratio
    return values * (values > max_values)
    
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    
# download the idf file from model hub. idf is used to give weights for query tokens
def get_tokenizer_idf(tokenizer):
    from huggingface_hub import hf_hub_download
    local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1", filename="idf.json")
    with open(local_cached_path) as f:
        idf = json.load(f)
    idf_vector = [0]*tokenizer.vocab_size
    for token,weight in idf.items():
        _id = tokenizer._convert_token_to_id_with_added_voc(token)
        idf_vector[_id]=weight
    return torch.tensor(idf_vector)

# load the model
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1")
idf = get_tokenizer_idf(tokenizer)

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token



query = "What's the weather in ny now?"
document = "Currently New York is rainy."

# encode the query
feature_query = tokenizer([query], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
input_ids = feature_query["input_ids"]
batch_size = input_ids.shape[0]
query_vector = torch.zeros(batch_size, tokenizer.vocab_size)
query_vector[torch.arange(batch_size).unsqueeze(-1), input_ids] = 1
query_sparse_vector = query_vector*idf

# encode the document
feature_document = tokenizer([document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature_document)[0]
document_sparse_vector = get_sparse_vector(feature_document, output)


# get similarity score
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
print(sim_score)   # tensor(7.6317, grad_fn=<DotBackward0>)


query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
document_query_token_weight = transform_sparse_vector_to_dict(document_sparse_vector)[0]
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
        

        
# result:
# score in query: 3.0699, score in document: 1.2821, token: weather
# score in query: 1.6406, score in document: 0.9018, token: now
# score in query: 1.6108, score in document: 0.3141, token: ?
# score in query: 1.2721, score in document: 1.3446, token: ny

上述代碼示例展示了神經稀疏搜索的一個示例。儘管原始查詢和文檔中沒有重疊的標記，但該模型仍能實現良好的匹配。

📄 許可證

本項目採用 Apache v2.0 許可證。

📚 詳細文檔

版權信息

版權歸 OpenSearch 貢獻者所有。詳情請見 NOTICE。

支持語言

屬性	詳情
支持語言	英語、中文、法語、孟加拉語、泰盧固語、西班牙語、印尼語、印地語、俄語、阿拉伯語、波斯語、日語、芬蘭語、斯瓦希里語、韓語
數據集	miracl/miracl
標籤	學習型稀疏、OpenSearch、Transformer、檢索、段落檢索、文檔擴展、詞袋模型
許可證	Apache 2.0