Bert-MLM_arXiv-MP-class_zbMath开源模型 - 免费计算短篇数学文本相似度

首页

Bert MLM Arxiv MP Class Zbmath

由 math-similarity 开发

这是一个基于sentence-transformers的模型，专门用于计算短篇数学文本的相似度，能将句子和段落映射到768维的密集向量空间。

文本嵌入

Transformers

#数学文本相似度 #短文本向量化 #学术论文匹配

下载量 415

发布时间 : 5/18/2024

模型简介

该模型设计用于处理数学领域的文本，特别适合计算数学论文摘要、定理描述等短文本的语义相似度，可用于聚类或语义搜索等任务。

模型特点

数学文本专用

专门针对数学领域的文本优化，能有效处理包含数学公式和术语的短文本

高维语义编码

将文本映射到768维密集向量空间，捕捉深层语义关系

句子转换器兼容

基于sentence-transformers框架，易于集成到现有NLP流程中

模型能力

数学文本相似度计算

语义向量生成

短文本聚类

学术文献检索

使用案例

学术研究

数学论文相似性检索

在数学文献数据库中查找与给定摘要相似的论文

提高相关文献检索的准确率

定理分类

基于定理描述的语义相似度进行自动分类

辅助数学知识库构建

教育技术

习题相似度匹配

在教育平台中匹配相似数学题目

支持个性化学习推荐

🚀 Bert-MLM_arXiv-MP-class_zbMath

这是一个 sentence-transformers 模型，它能将句子和段落映射到一个 768 维的密集向量空间，可用于聚类或语义搜索等任务。该模型专门用于计算短数学文本的相似度。

🚀 快速开始

本模型可通过 sentence-transformers 或 HuggingFace Transformers 两种方式使用，以下是具体的使用步骤。

📦 安装指南

若要使用 sentence-transformers 来使用本模型，可通过以下命令进行安装：

pip install -U sentence-transformers

💻 使用示例

基础用法（Sentence-Transformers）

安装好 sentence-transformers 后，你可以按如下方式使用该模型：

from sentence_transformers import SentenceTransformer
sentences = ["In this paper we show how to compute the $\\Lambda_{\\alpha}$ norm, $\\alpha\\ge 0$, using the dyadic grid. This result is a consequence of the description of the Hardy spaces $H^p(R^N)$ in terms of dyadic and special atoms.",
             "We show that a determinant of Stirling cycle numbers counts unlabeled acyclic single-source automata. The proof involves a bijection from these automata to certain marked lattice paths and a sign-reversing involution to evaluate the determinant."]

model = SentenceTransformer('math-similarity/Bert-MLM_arXiv-MP-class_zbMath')
embeddings = model.encode(sentences)
print(embeddings)

高级用法（HuggingFace Transformers）

若未安装 sentence-transformers，你可以按以下方式使用该模型：首先，将输入数据传入 Transformer 模型，然后对上下文词嵌入应用正确的池化操作。

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["In this paper we show how to compute the $\\Lambda_{\\alpha}$ norm, $\\alpha\\ge 0$, using the dyadic grid. This result is a consequence of the description of the Hardy spaces $H^p(R^N)$ in terms of dyadic and special atoms.",
             "We show that a determinant of Stirling cycle numbers counts unlabeled acyclic single-source automata. The proof involves a bijection from these automata to certain marked lattice paths and a sign-reversing involution to evaluate the determinant."]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('math-similarity/Bert-MLM_arXiv-MP-class_zbMath')
model = AutoModel.from_pretrained('math-similarity/Bert-MLM_arXiv-MP-class_zbMath')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 详细文档

预期用途

本模型旨在作为数学文本的句子和短段落编码器。给定输入文本，它会输出一个捕获语义信息的向量。该句子向量可用于信息检索、聚类或句子相似度任务。默认情况下，输入文本超过 256 个词片时会被截断。

训练过程

领域自适应

我们使用了经过领域自适应的 math-similarity/Bert-MLM_arXiv 模型。有关领域自适应过程的更多详细信息，请参考该模型卡片。

池化

我们在领域自适应模型的基础上添加了一个平均池化层。

微调

我们使用余弦相似度目标对模型进行微调。形式上，它计算向量 u = model(sentence_A) 和 v = model(sentence_B)，并测量两者之间的余弦相似度。默认情况下，它最小化以下损失：||input_label - cos_score_transformation(cosine_sim(u,v))||_2，使用均方误差（MSE）作为损失函数。

我们使用来自 zbMath 的标题对作为微调数据集，并使用它们的 MSC 代码对语义相似度进行建模。如果两个标题共享主要的 MSC₅ 和另一个次要的 MSC₅，则定义它们为相似；否则，定义它们在语义上不相似。训练集包含 351,472 个标题对，评估集包含 43,935 个标题对。更多信息请参阅训练笔记本。

由于许可问题，我们无法提供包含标题的数据集。不过，我们创建了一个仅包含相应 zbMath 标识符（也称为 an）以及主要和次要 MSC 分类但不包含标题的数据集。该数据集可在 datasets/math-similarity/class-zbmath-identifier 中获取。