Bert-MLM_arXiv-MP-class_zbMath開源模型 - 免費計算短篇數學文本相似度

首頁

Bert MLM Arxiv MP Class Zbmath

由math-similarity開發

這是一個基於sentence-transformers的模型，專門用於計算短篇數學文本的相似度，能將句子和段落映射到768維的密集向量空間。

文本嵌入

Transformers

#數學文本相似度 #短文本向量化 #學術論文匹配

下載量 415

發布時間 : 5/18/2024

模型概述

該模型設計用於處理數學領域的文本，特別適合計算數學論文摘要、定理描述等短文本的語義相似度，可用於聚類或語義搜索等任務。

模型特點

數學文本專用

專門針對數學領域的文本優化，能有效處理包含數學公式和術語的短文本

高維語義編碼

將文本映射到768維密集向量空間，捕捉深層語義關係

句子轉換器兼容

基於sentence-transformers框架，易於集成到現有NLP流程中

模型能力

數學文本相似度計算

語義向量生成

短文本聚類

學術文獻檢索

使用案例

學術研究

數學論文相似性檢索

在數學文獻數據庫中查找與給定摘要相似的論文

提高相關文獻檢索的準確率

定理分類

基於定理描述的語義相似度進行自動分類

輔助數學知識庫構建

教育技術

習題相似度匹配

在教育平臺中匹配相似數學題目

支持個性化學習推薦

🚀 Bert-MLM_arXiv-MP-class_zbMath

這是一個 sentence-transformers 模型，它能將句子和段落映射到一個 768 維的密集向量空間，可用於聚類或語義搜索等任務。該模型專門用於計算短數學文本的相似度。

🚀 快速開始

本模型可通過 sentence-transformers 或 HuggingFace Transformers 兩種方式使用，以下是具體的使用步驟。

📦 安裝指南

若要使用 sentence-transformers 來使用本模型，可通過以下命令進行安裝：

pip install -U sentence-transformers

💻 使用示例

基礎用法（Sentence-Transformers）

安裝好 sentence-transformers 後，你可以按如下方式使用該模型：

from sentence_transformers import SentenceTransformer
sentences = ["In this paper we show how to compute the $\\Lambda_{\\alpha}$ norm, $\\alpha\\ge 0$, using the dyadic grid. This result is a consequence of the description of the Hardy spaces $H^p(R^N)$ in terms of dyadic and special atoms.",
             "We show that a determinant of Stirling cycle numbers counts unlabeled acyclic single-source automata. The proof involves a bijection from these automata to certain marked lattice paths and a sign-reversing involution to evaluate the determinant."]

model = SentenceTransformer('math-similarity/Bert-MLM_arXiv-MP-class_zbMath')
embeddings = model.encode(sentences)
print(embeddings)

高級用法（HuggingFace Transformers）

若未安裝 sentence-transformers，你可以按以下方式使用該模型：首先，將輸入數據傳入 Transformer 模型，然後對上下文詞嵌入應用正確的池化操作。

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["In this paper we show how to compute the $\\Lambda_{\\alpha}$ norm, $\\alpha\\ge 0$, using the dyadic grid. This result is a consequence of the description of the Hardy spaces $H^p(R^N)$ in terms of dyadic and special atoms.",
             "We show that a determinant of Stirling cycle numbers counts unlabeled acyclic single-source automata. The proof involves a bijection from these automata to certain marked lattice paths and a sign-reversing involution to evaluate the determinant."]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('math-similarity/Bert-MLM_arXiv-MP-class_zbMath')
model = AutoModel.from_pretrained('math-similarity/Bert-MLM_arXiv-MP-class_zbMath')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 詳細文檔

預期用途

本模型旨在作為數學文本的句子和短段落編碼器。給定輸入文本，它會輸出一個捕獲語義信息的向量。該句子向量可用於信息檢索、聚類或句子相似度任務。默認情況下，輸入文本超過 256 個詞片時會被截斷。

訓練過程

領域自適應

我們使用了經過領域自適應的 math-similarity/Bert-MLM_arXiv 模型。有關領域自適應過程的更多詳細信息，請參考該模型卡片。

池化

我們在領域自適應模型的基礎上添加了一個平均池化層。

微調

我們使用餘弦相似度目標對模型進行微調。形式上，它計算向量 u = model(sentence_A) 和 v = model(sentence_B)，並測量兩者之間的餘弦相似度。默認情況下，它最小化以下損失：||input_label - cos_score_transformation(cosine_sim(u,v))||_2，使用均方誤差（MSE）作為損失函數。

我們使用來自 zbMath 的標題對作為微調數據集，並使用它們的 MSC 代碼對語義相似度進行建模。如果兩個標題共享主要的 MSC₅ 和另一個次要的 MSC₅，則定義它們為相似；否則，定義它們在語義上不相似。訓練集包含 351,472 個標題對，評估集包含 43,935 個標題對。更多信息請參閱訓練筆記本。

由於許可問題，我們無法提供包含標題的數據集。不過，我們創建了一個僅包含相應 zbMath 標識符（也稱為 an）以及主要和次要 MSC 分類但不包含標題的數據集。該數據集可在 datasets/math-similarity/class-zbmath-identifier 中獲取。