MathBERT-custom開源模型 - 專注數學語言理解，免費支持英語數學文本處理

首頁

Mathbert Custom

由tbs17開發

基於數學領域英語文本預訓練的BERT模型，專注於數學語言理解任務

大型語言模型

Transformers

#數學語言理解 #教育領域專用 #雙向上下文建模

下載量 214

發布時間 : 3/2/2022

模型概述

通過自監督方式在大型數學語料庫上預訓練的Transformer模型，支持掩碼語言建模和下一句預測任務，特別優化於數學相關文本處理

模型特點

數學領域優化

專門針對數學文本訓練，包含從學前到研究生階段的數學語言

自定義詞彙表

使用30,522個詞彙的定製詞彙表，優化數學術語處理

雙向上下文理解

通過MLM目標實現句子雙向表徵學習

不區分大小寫

統一處理大小寫變體，提升模型魯棒性

模型能力

數學文本特徵提取

數學問題理解

數學術語預測

數學句子關係判斷

使用案例

教育技術

數學問題解答系統

作為數學問答系統的特徵提取模塊

在數學問題文本填充任務中表現優於通用模型

數學教材分析

分析數學教材內容結構

學術研究

數學論文處理

處理arXiv數學論文摘要

🚀 MathBERT模型（自定義詞表）

MathBERT是一個預訓練模型，它基於從幼兒園到研究生階段的數學語言（英語）數據，採用掩碼語言模型（MLM）目標進行預訓練。該模型不區分大小寫，例如“english”和“English”對它來說是一樣的。

✨ 主要特性

MathBERT是一個以自監督方式在大量英語數學語料庫數據上進行預訓練的Transformer模型。它僅在原始文本上進行預訓練，沒有人工進行任何標註，而是通過自動流程從這些文本中生成輸入和標籤。具體而言，它通過兩個目標進行預訓練：

掩碼語言模型（MLM）：給定一個句子，模型會隨機掩蓋輸入中15%的單詞，然後將整個掩碼後的句子輸入模型，讓模型預測被掩蓋的單詞。這與傳統的循環神經網絡（RNN）不同，RNN通常是逐個處理單詞；也與像GPT這樣的自迴歸模型不同，GPT會在內部掩蓋未來的標記。這種方式使模型能夠學習句子的雙向表示。
下一句預測（NSP）：在預訓練期間，模型將兩個掩碼後的句子作為輸入進行拼接。有時這兩個句子在原始文本中是相鄰的，有時則不是。模型需要預測這兩個句子是否相鄰。

通過這種方式，模型學習到數學語言的內部表示，可用於提取對下游任務有用的特徵。例如，如果有一個帶標籤的句子數據集，可以使用MathBERT模型生成的特徵作為輸入來訓練一個標準分類器。

📦 安裝指南

文檔未提供安裝步驟，跳過該章節。

💻 使用示例

基礎用法

以下是如何在PyTorch中使用該模型獲取給定文本特徵的示例：

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('tbs17/MathBERT-custom')
model = BertModel.from_pretrained("tbs17/MathBERT-custom")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')["input_ids"]
output = model(encoded_input)

高級用法

以下是在TensorFlow中使用該模型獲取給定文本特徵的示例：

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('tbs17/MathBERT-custom')
model = TFBertModel.from_pretrained("tbs17/MathBERT-custom")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

📚 詳細文檔

預期用途和侷限性

可以使用原始模型進行掩碼語言建模或下一句預測，但它主要用於在與數學相關的下游任務上進行微調。

需要注意的是，該模型主要旨在針對使用整個句子（可能是掩碼後的）進行決策的數學相關任務進行微調，例如序列分類、標記分類或問答任務。對於數學文本生成等任務，建議使用像GPT2這樣的模型。

警告

MathBERT是專門為數學相關任務設計的，在數學問題文本的掩碼填充任務中表現更好，而不是通用的掩碼填充任務。以下是示例：

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='tbs17/MathBERT')
# 以下是期望的使用方式
>>> unmasker("students apply these new understandings as they reason about and perform decimal [MASK] through the hundredths place.")

[{'score': 0.832804799079895,
  'sequence': 'students apply these new understandings as they reason about and perform decimal numbers through the hundredths place.',
  'token': 3616,
  'token_str': 'numbers'},
 {'score': 0.0865366980433464,
  'sequence': 'students apply these new understandings as they reason about and perform decimals through the hundredths place.',
  'token': 2015,
  'token_str': '##s'},
 {'score': 0.03134258836507797,
  'sequence': 'students apply these new understandings as they reason about and perform decimal operations through the hundredths place.',
  'token': 3136,
  'token_str': 'operations'},
 {'score': 0.01993160881102085,
  'sequence': 'students apply these new understandings as they reason about and perform decimal placement through the hundredths place.',
  'token': 11073,
  'token_str': 'placement'},
 {'score': 0.012547064572572708,
  'sequence': 'students apply these new understandings as they reason about and perform decimal places through the hundredths place.',
  'token': 3182,
  'token_str': 'places'}]

# 以下不是期望的使用方式
>>> unmasker("The man worked as a [MASK].")

[{'score': 0.6469377875328064,
  'sequence': 'the man worked as a book.',
  'token': 2338,
  'token_str': 'book'},
 {'score': 0.07073448598384857,
  'sequence': 'the man worked as a guide.',
  'token': 5009,
  'token_str': 'guide'},
 {'score': 0.031362924724817276,
  'sequence': 'the man worked as a text.',
  'token': 3793,
  'token_str': 'text'},
 {'score': 0.02306508645415306,
  'sequence': 'the man worked as a man.',
  'token': 2158,
  'token_str': 'man'},
 {'score': 0.020547250285744667,
  'sequence': 'the man worked as a distance.',
  'token': 3292,
  'token_str': 'distance'}]