fermi-1024開源稀疏檢索模型 - 優化核能領域查詢，提升術語處理效率

首頁

Fermi 1024

由atomic-canyon開發

面向核能領域優化的稀疏檢索模型，將查詢和文檔編碼為高維稀疏向量，提升核能專業術語的處理效率

文本嵌入

Transformers

英語開源協議:Apache-2.0 #核能領域專用 #稀疏檢索優化 #高維稀疏向量

下載量 2,502

發布時間 : 9/4/2024

模型概述

專為核能應用設計的稀疏檢索模型，採用核能領域專用分詞器構建詞彙表和稀疏嵌入表示，優化了如'NRC'等專業術語的處理

模型特點

核能領域優化

使用核能領域專用分詞器，將專業術語作為獨立詞元處理，提升準確性和效率

高效稀疏表示

生成高維稀疏向量，非零維度對應詞彙表中的重要詞元，顯著降低計算和存儲需求

長上下文支持

支持1024長度的上下文窗口，減少所需嵌入數量，降低計算成本

模型能力

核能領域文檔檢索

專業術語識別

高效向量編碼

大規模文檔處理

使用案例

核能信息檢索

核能法規檢索

快速檢索NRC等核能監管機構的相關法規文檔

在FermiBench上達到0.72 NDCG@10

技術文檔搜索

高效搜索核電站技術文檔中的特定內容

比通用模型減少50%計算成本

🚀 fermi-1024：用於核能的稀疏檢索模型

這是一個針對核能特定應用優化的稀疏檢索模型。它將查詢和文檔都編碼為高維稀疏向量，其中非零維度對應詞彙表中的特定標記，其值表示這些標記的相對重要性。

詞彙表以及由此產生的稀疏嵌入基於核能特定的分詞器。例如，像 “NRC” 這樣的術語被表示為單個標記，而不是拆分為多個標記。這種方法提高了準確性和效率。為了實現這一點，我們訓練了一個核能特定的 BERT 基礎模型。

🚀 快速開始

本模型是專為核能特定應用優化的稀疏檢索模型，它能將查詢和文檔編碼為高維稀疏向量，提升檢索的準確性與效率。

✨ 主要特性

特定領域優化：針對核能領域應用進行優化，能更好地處理核能相關的查詢和文檔。
高效編碼：使用核能特定的分詞器，將查詢和文檔編碼為高維稀疏向量，提高了準確性和效率。
節省計算資源：1024 長度的嵌入模型將所需嵌入數量減少一半，降低了計算成本；自定義分詞器使用更少的標記進行編碼，提高了計算效率；模型生成的向量更稀疏，減少了浮點運算次數並降低了索引存儲需求。

📦 安裝指南

文檔未提及具體安裝步驟，故跳過該部分內容。

💻 使用示例

基礎用法

import itertools
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer


# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    values = torch.log(1 + torch.relu(values))
    values[:,special_token_ids] = 0
    return values
    
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    

# load the model
model = AutoModelForMaskedLM.from_pretrained("atomic-canyon/fermi-1024")
tokenizer = AutoTokenizer.from_pretrained("atomic-canyon/fermi-1024")

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
id_to_token = [""] * tokenizer.vocab_size
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token

query = "What is the maximum heat load per spent fuel assembly for the EOS-37PTH?"
document = "For the EOS-37PTH DSC, add two new heat load zone configurations (HLZCs) for the EOS37PTH for higher heat load assemblies, up to 3.5 kW/assembly, that also allow for damaged and failed fuel storage."

# encode the query & document
feature = tokenizer([query, document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature)[0]
sparse_vector = get_sparse_vector(feature, output)

# get similarity score
sim_score = torch.matmul(sparse_vector[0],sparse_vector[1])
print(sim_score)


query_token_weight, document_query_token_weight = transform_sparse_vector_to_dict(sparse_vector)
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))

📚 詳細文檔

規格參數

屬性	詳情
開發方	Atomic Canyon
微調基礎模型	fermi-bert-1024
上下文長度	1024
詞彙表大小	30522
許可證	`Apache 2.0`

訓練情況

fermi-1024 模型在 MS MARCO Passage 數據集上進行訓練，使用 LSR 框架，教師模型為 ms-marco-MiniLM-L-6-v2。訓練在橡樹嶺國家實驗室的 Frontier 超級計算機上使用 MI250X AMD GPU 進行。

評估情況

該稀疏嵌入模型主要針對核能領域的信息檢索效果進行評估。由於缺乏特定領域的基準測試，我們開發了 FermiBench 來評估模型在核能相關文本上的性能。此外，模型還在 MS MARCO 開發集和 BEIR 基準測試上進行了測試，以確保更廣泛的適用性。模型展現出強大的檢索能力，尤其在處理核能特定的行話和文檔方面表現出色。

雖然有評估密集嵌入模型的標準基準和工具，但我們未找到用於評估稀疏嵌入模型的開放、標準化工具。為了支持社區，我們正在發佈我們的基準測試工具，該工具基於 BEIR 和 pyserini 構建。所有評估數據均使用該工具生成，因此應該是可復現的。