opensearch - neural - sparse - encoding - multilingual - v1开源模型

Home

Opensearch Neural Sparse Encoding Multilingual V1

Developed by opensearch-project

一个支持15种语言的学习型稀疏检索模型，专为OpenSearch设计，无需推理即可实现高效检索

文本嵌入

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #多语言稀疏检索 #无推理搜索 #高维稀疏向量

Downloads 121

Release Time : 2/21/2025

Model Overview

该模型将文档编码为105879维稀疏向量，通过词元权重实现高效检索，支持OpenSearch神经稀疏特性

Model Features

无需推理检索

检索时仅需使用分词器和权重查找表生成稀疏向量，无需完整模型推理

多语言支持

支持15种语言的跨语言文档检索

高效稀疏编码

将文档编码为105879维稀疏向量，优化检索效率

OpenSearch集成

专为OpenSearch设计，支持通过Lucene倒排索引实现学习型稀疏检索

Model Capabilities

多语言文档检索

稀疏向量生成

高效相似度计算

跨语言搜索

Use Cases

信息检索

多语言文档搜索

在多语言文档库中实现高效检索

在MIRACL基准测试中平均NDCG@10达到0.629

企业搜索

用于企业内部多语言文档的搜索系统

相比传统BM25方法有显著提升

🚀 多语言v1版OpenSearch神经稀疏编码模型

本项目是一个多语言的学习型稀疏检索模型，可将文档编码为高维稀疏向量，通过内积计算相似度得分，在多语言检索任务中表现出色。

🚀 快速开始

模型选择

选择模型时，应综合考虑搜索相关性、模型推理和检索效率（FLOPS）。我们在 MIRACL 基准测试中对模型性能进行了评估（由于无大小写区分的主干模型无法对其进行编码，因此排除了 th 语言）。 我们建议使用最大比率剪枝法。

模型	免推理检索	模型参数	平均NDCG@10	平均FLOPS	平均嵌入大小
opensearch-neural-sparse-encoding-multilingual-v1	✔️	1.6亿	0.629	1.3	138
opensearch-neural-sparse-encoding-multilingual-v1; 剪枝率 0.1	✔️	1.6亿	0.626	0.8	75

✨ 主要特性

模型概述

论文：Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers
微调示例：opensearch-sparse-model-tuning-sample

这是一个学习型稀疏检索模型，它将文档编码为 105879 维的 稀疏向量。对于查询，它仅使用分词器和权重查找表来生成稀疏向量。非零维度索引表示词汇表中对应的标记，权重表示该标记的重要性。相似度得分是查询/文档稀疏向量的内积。

OpenSearch 神经稀疏特征支持使用 Lucene 倒排索引进行学习型稀疏检索。链接：https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/。可以使用 OpenSearch 高级 API 进行索引和搜索。

详细搜索相关性

模型	平均	孟加拉语	泰卢固语	西班牙语	法语	印尼语	印地语	俄语	阿拉伯语	中文	波斯语	日语	芬兰语	斯瓦希里语	韩语	英语
BM25	0.305	0.482	0.383	0.077	0.115	0.297	0.350	0.256	0.395	0.175	0.287	0.312	0.458	0.351	0.371	0.267
opensearch-neural-sparse-encoding-multilingual-v1	0.629	0.670	0.740	0.542	0.558	0.582	0.486	0.658	0.740	0.562	0.514	0.669	0.767	0.768	0.607	0.575
opensearch-neural-sparse-encoding-multilingual-v1; 剪枝率 0.1	0.626	0.667	0.740	0.537	0.555	0.576	0.481	0.655	0.737	0.558	0.511	0.664	0.761	0.766	0.604	0.572

💻 使用示例

基础用法

import json
import itertools
import torch

from transformers import AutoModelForMaskedLM, AutoTokenizer


# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output, prune_ratio=0.1):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    values = torch.log(1 + torch.relu(values))
    values[:,special_token_ids] = 0
    max_values = values.max(dim=-1)[0].unsqueeze(1) * prune_ratio
    return values * (values > max_values)
    
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    
# download the idf file from model hub. idf is used to give weights for query tokens
def get_tokenizer_idf(tokenizer):
    from huggingface_hub import hf_hub_download
    local_cached_path = hf_hub_download(repo_id="opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1", filename="idf.json")
    with open(local_cached_path) as f:
        idf = json.load(f)
    idf_vector = [0]*tokenizer.vocab_size
    for token,weight in idf.items():
        _id = tokenizer._convert_token_to_id_with_added_voc(token)
        idf_vector[_id]=weight
    return torch.tensor(idf_vector)

# load the model
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1")
idf = get_tokenizer_idf(tokenizer)

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token



query = "What's the weather in ny now?"
document = "Currently New York is rainy."

# encode the query
feature_query = tokenizer([query], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
input_ids = feature_query["input_ids"]
batch_size = input_ids.shape[0]
query_vector = torch.zeros(batch_size, tokenizer.vocab_size)
query_vector[torch.arange(batch_size).unsqueeze(-1), input_ids] = 1
query_sparse_vector = query_vector*idf

# encode the document
feature_document = tokenizer([document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature_document)[0]
document_sparse_vector = get_sparse_vector(feature_document, output)


# get similarity score
sim_score = torch.matmul(query_sparse_vector[0],document_sparse_vector[0])
print(sim_score)   # tensor(7.6317, grad_fn=<DotBackward0>)


query_token_weight = transform_sparse_vector_to_dict(query_sparse_vector)[0]
document_query_token_weight = transform_sparse_vector_to_dict(document_sparse_vector)[0]
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
        

        
# result:
# score in query: 3.0699, score in document: 1.2821, token: weather
# score in query: 1.6406, score in document: 0.9018, token: now
# score in query: 1.6108, score in document: 0.3141, token: ?
# score in query: 1.2721, score in document: 1.3446, token: ny

上述代码示例展示了神经稀疏搜索的一个示例。尽管原始查询和文档中没有重叠的标记，但该模型仍能实现良好的匹配。

📄 许可证

本项目采用 Apache v2.0 许可证。

📚 详细文档

版权信息

版权归 OpenSearch 贡献者所有。详情请见 NOTICE。

支持语言

属性	详情
支持语言	英语、中文、法语、孟加拉语、泰卢固语、西班牙语、印尼语、印地语、俄语、阿拉伯语、波斯语、日语、芬兰语、斯瓦希里语、韩语
数据集	miracl/miracl
标签	学习型稀疏、OpenSearch、Transformer、检索、段落检索、文档扩展、词袋模型
许可证	Apache 2.0