MathBERT-custom Open-source Model - Focusing on Mathematical Language Understanding, Free Support for English Mathematical Text Processing

Home

Mathbert Custom

Developed by tbs17

BERT model pre-trained on English mathematical texts, specializing in mathematical language understanding tasks

Large Language Model

Transformers

#Mathematical Language Understanding #Education-Specific #Bidirectional Context Modeling

Downloads 214

Release Time : 3/2/2022

Model Overview

Transformer model pre-trained via self-supervised learning on a large mathematical corpus, supporting masked language modeling and next sentence prediction tasks, specifically optimized for mathematical text processing

Model Features

Mathematical Domain Optimization

Specially trained on mathematical texts, covering mathematical language from preschool to graduate level

Custom Vocabulary

Uses a custom vocabulary of 30,522 words optimized for mathematical terminology processing

Bidirectional Context Understanding

Achieves bidirectional sentence representation learning through MLM objectives

Case Insensitivity

Uniformly processes case variants to enhance model robustness

Model Capabilities

Mathematical Text Feature Extraction

Mathematical Problem Understanding

Mathematical Terminology Prediction

Mathematical Sentence Relation Judgment

Use Cases

Educational Technology

Math Problem Solving System

Serves as the feature extraction module for math Q&A systems

Outperforms general models in mathematical text completion tasks

Math Textbook Analysis

Analyzes the content structure of math textbooks

Academic Research

Math Paper Processing

Processes arXiv math paper abstracts

🚀 MathBERT model (custom vocab)

A pretrained model on pre - k to graduate math language (English) using masked language modeling (MLM) objective.

🚀 Quick Start

The MathBERT model is pretrained on a vast English math corpus in a self - supervised way. It can be used for masked language modeling or next sentence prediction out - of - the - box, but it's mainly designed to be fine - tuned on math - related downstream tasks.

✨ Features

Self - Supervised Learning: Trained on raw math texts without human labeling, using masked language modeling (MLM) and next sentence prediction (NSP) objectives.
Bidirectional Representation: Learns a bidirectional understanding of math language, different from traditional RNNs and autoregressive models.
Feature Extraction: Can extract useful features for downstream math - related tasks.

📚 Documentation

Model Description

MathBERT is a transformers model pretrained on a large corpus of English math data in a self - supervised manner. It was pretrained with two main objectives:

Masked Language Modeling (MLM): Randomly masks 15% of the words in an input sentence and predicts the masked words. This helps the model learn a bidirectional representation of the sentence.
Next Sentence Prediction (NSP): Concatenates two masked sentences during pretraining and predicts if they were consecutive in the original text.

The model learns an internal representation of math language, which can be used to extract features for downstream tasks. For example, you can train a classifier using the features generated by MathBERT.

Intended Uses & Limitations

Intended Uses: The raw model can be used for masked language modeling or next sentence prediction. However, it's mainly intended for fine - tuning on math - related downstream tasks such as sequence classification, token classification, or question answering.
Limitations: It is not suitable for tasks like math text generation. For such tasks, models like GPT2 are recommended.

How to use

You can use this model to get the features of a given text in PyTorch and TensorFlow:

PyTorch

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('tbs17/MathBERT-custom')
model = BertModel.from_pretrained("tbs17/MathBERT-custom")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')["input_ids"]
output = model(encoded_input)

TensorFlow

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('tbs17/MathBERT-custom')
model = TFBertModel.from_pretrained("tbs17/MathBERT-custom")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

Warning

⚠️ Important Note

MathBERT is specifically designed for mathematics - related tasks. It performs better on mathematical problem text fill - mask tasks than general - purpose fill - mask tasks. See the following example:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='tbs17/MathBERT')
# Desired usage
>>> unmasker("students apply these new understandings as they reason about and perform decimal [MASK] through the hundredths place.")

[{'score': 0.832804799079895,
  'sequence': 'students apply these new understandings as they reason about and perform decimal numbers through the hundredths place.',
  'token': 3616,
  'token_str': 'numbers'},
 {'score': 0.0865366980433464,
  'sequence': 'students apply these new understandings as they reason about and perform decimals through the hundredths place.',
  'token': 2015,
  'token_str': '##s'},
 {'score': 0.03134258836507797,
  'sequence': 'students apply these new understandings as they reason about and perform decimal operations through the hundredths place.',
  'token': 3136,
  'token_str': 'operations'},
 {'score': 0.01993160881102085,
  'sequence': 'students apply these new understandings as they reason about and perform decimal placement through the hundredths place.',
  'token': 11073,
  'token_str': 'placement'},
 {'score': 0.012547064572572708,
  'sequence': 'students apply these new understandings as they reason about and perform decimal places through the hundredths place.',
  'token': 3182,
  'token_str': 'places'}]

# Undesired usage
>>> unmasker("The man worked as a [MASK].")

[{'score': 0.6469377875328064,
  'sequence': 'the man worked as a book.',
  'token': 2338,
  'token_str': 'book'},
 {'score': 0.07073448598384857,
  'sequence': 'the man worked as a guide.',
  'token': 5009,
  'token_str': 'guide'},
 {'score': 0.031362924724817276,
  'sequence': 'the man worked as a text.',
  'token': 3793,
  'token_str': 'text'},
 {'score': 0.02306508645415306,
  'sequence': 'the man worked as a man.',
  'token': 2158,
  'token_str': 'man'},
 {'score': 0.020547250285744667,
  'sequence': 'the man worked as a distance.',
  'token': 3292,
  'token_str': 'distance'}]

Training data

Property	Details
Model Type	MathBERT (custom vocab)
Training Data	Pre - k to HS math curriculum (engageNY, Utah Math, Illustrative Math), college math books from openculture.com, and graduate - level math from arxiv math paper abstracts. Approximately 100M tokens were used for pretraining.

Training procedure

The texts are lowercased and tokenized using WordPiece with a customized vocabulary size of 30,522. The bert_tokenizer from the huggingface tokenizers library is used to generate a custom vocab file from the training raw math texts.

The model inputs are in the form:

[CLS] Sentence A [SEP] Sentence B [SEP]

With a probability of 0.5, sentence A and sentence B are consecutive in the original corpus; otherwise, sentence B is a random sentence from the corpus. The combined length of the two "sentences" must be less than 512 tokens.

The masking procedure for each sentence:

15% of the tokens are masked.
In 80% of cases, masked tokens are replaced by [MASK].
In 10% of cases, masked tokens are replaced by a random token.
In the remaining 10% of cases, masked tokens remain unchanged.

Pretraining

The model was trained on 8 - core cloud TPUs from Google Colab for 600k steps with a batch size of 128. The sequence length was limited to 512 throughout the training. The optimizer used was Adam with a learning rate of 5e - 5, beta₁ = 0.9, beta₂ = 0.999, a weight decay of 0.01, a learning rate warm - up for 10,000 steps, and linear decay of the learning rate afterward.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご