🚀 MathBERT model (custom vocab)
A pretrained model on pre - k to graduate math language (English) using masked language modeling (MLM) objective.
🚀 Quick Start
The MathBERT model is pretrained on a vast English math corpus in a self - supervised way. It can be used for masked language modeling or next sentence prediction out - of - the - box, but it's mainly designed to be fine - tuned on math - related downstream tasks.
✨ Features
- Self - Supervised Learning: Trained on raw math texts without human labeling, using masked language modeling (MLM) and next sentence prediction (NSP) objectives.
- Bidirectional Representation: Learns a bidirectional understanding of math language, different from traditional RNNs and autoregressive models.
- Feature Extraction: Can extract useful features for downstream math - related tasks.
📚 Documentation
Model Description
MathBERT is a transformers model pretrained on a large corpus of English math data in a self - supervised manner. It was pretrained with two main objectives:
- Masked Language Modeling (MLM): Randomly masks 15% of the words in an input sentence and predicts the masked words. This helps the model learn a bidirectional representation of the sentence.
- Next Sentence Prediction (NSP): Concatenates two masked sentences during pretraining and predicts if they were consecutive in the original text.
The model learns an internal representation of math language, which can be used to extract features for downstream tasks. For example, you can train a classifier using the features generated by MathBERT.
Intended Uses & Limitations
- Intended Uses: The raw model can be used for masked language modeling or next sentence prediction. However, it's mainly intended for fine - tuning on math - related downstream tasks such as sequence classification, token classification, or question answering.
- Limitations: It is not suitable for tasks like math text generation. For such tasks, models like GPT2 are recommended.
How to use
You can use this model to get the features of a given text in PyTorch and TensorFlow:
PyTorch
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('tbs17/MathBERT-custom')
model = BertModel.from_pretrained("tbs17/MathBERT-custom")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')["input_ids"]
output = model(encoded_input)
TensorFlow
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('tbs17/MathBERT-custom')
model = TFBertModel.from_pretrained("tbs17/MathBERT-custom")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
Warning
⚠️ Important Note
MathBERT is specifically designed for mathematics - related tasks. It performs better on mathematical problem text fill - mask tasks than general - purpose fill - mask tasks. See the following example:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='tbs17/MathBERT')
>>> unmasker("students apply these new understandings as they reason about and perform decimal [MASK] through the hundredths place.")
[{'score': 0.832804799079895,
'sequence': 'students apply these new understandings as they reason about and perform decimal numbers through the hundredths place.',
'token': 3616,
'token_str': 'numbers'},
{'score': 0.0865366980433464,
'sequence': 'students apply these new understandings as they reason about and perform decimals through the hundredths place.',
'token': 2015,
'token_str': '##s'},
{'score': 0.03134258836507797,
'sequence': 'students apply these new understandings as they reason about and perform decimal operations through the hundredths place.',
'token': 3136,
'token_str': 'operations'},
{'score': 0.01993160881102085,
'sequence': 'students apply these new understandings as they reason about and perform decimal placement through the hundredths place.',
'token': 11073,
'token_str': 'placement'},
{'score': 0.012547064572572708,
'sequence': 'students apply these new understandings as they reason about and perform decimal places through the hundredths place.',
'token': 3182,
'token_str': 'places'}]
>>> unmasker("The man worked as a [MASK].")
[{'score': 0.6469377875328064,
'sequence': 'the man worked as a book.',
'token': 2338,
'token_str': 'book'},
{'score': 0.07073448598384857,
'sequence': 'the man worked as a guide.',
'token': 5009,
'token_str': 'guide'},
{'score': 0.031362924724817276,
'sequence': 'the man worked as a text.',
'token': 3793,
'token_str': 'text'},
{'score': 0.02306508645415306,
'sequence': 'the man worked as a man.',
'token': 2158,
'token_str': 'man'},
{'score': 0.020547250285744667,
'sequence': 'the man worked as a distance.',
'token': 3292,
'token_str': 'distance'}]
Training data
Property |
Details |
Model Type |
MathBERT (custom vocab) |
Training Data |
Pre - k to HS math curriculum (engageNY, Utah Math, Illustrative Math), college math books from openculture.com, and graduate - level math from arxiv math paper abstracts. Approximately 100M tokens were used for pretraining. |
Training procedure
The texts are lowercased and tokenized using WordPiece with a customized vocabulary size of 30,522. The bert_tokenizer
from the huggingface tokenizers library is used to generate a custom vocab file from the training raw math texts.
The model inputs are in the form:
[CLS] Sentence A [SEP] Sentence B [SEP]
With a probability of 0.5, sentence A and sentence B are consecutive in the original corpus; otherwise, sentence B is a random sentence from the corpus. The combined length of the two "sentences" must be less than 512 tokens.
The masking procedure for each sentence:
- 15% of the tokens are masked.
- In 80% of cases, masked tokens are replaced by [MASK].
- In 10% of cases, masked tokens are replaced by a random token.
- In the remaining 10% of cases, masked tokens remain unchanged.
Pretraining
The model was trained on 8 - core cloud TPUs from Google Colab for 600k steps with a batch size of 128. The sequence length was limited to 512 throughout the training. The optimizer used was Adam with a learning rate of 5e - 5, beta₁ = 0.9, beta₂ = 0.999, a weight decay of 0.01, a learning rate warm - up for 10,000 steps, and linear decay of the learning rate afterward.