đ BatteryOnlyBERT-cased model
A pretrained model on a large corpus of battery research papers, using masked language modeling (MLM) to learn bidirectional sentence representations for downstream tasks.
đ Quick Start
The BatteryOnlyBERT-cased model is a transformers model pretrained on a large corpus of battery research papers. You can use it directly for masked language modeling or fine - tune it for downstream tasks.
Here is an example of using it with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='batterydata/batteryonlybert-cased')
>>> unmasker("Hello I'm a <mask> model.")
⨠Features
- Bidirectional Representation: Pretrained with the Masked language modeling (MLM) objective, allowing it to learn a bidirectional representation of the sentence.
- Case - Sensitive: It makes a difference between words with different capitalizations, like "english" and "English".
- Useful for Downstream Tasks: Can be used to extract features for downstream tasks such as sequence classification, token classification, or question answering.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
from transformers import pipeline
unmasker = pipeline('fill-mask', model='batterydata/batteryonlybert-cased')
result = unmasker("Hello I'm a <mask> model.")
print(result)
Advanced Usage
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('batterydata/batteryonlybert-cased')
model = BertModel.from_pretrained('batterydata/batteryonlybert-cased')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
print(output)
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('batterydata/batteryonlybert-cased')
model = TFBertModel.from_pretrained('batterydata/batteryonlybert-cased')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
print(output)
đ Documentation
Model description
BatteryOnlyBERT is a transformers model pretrained on a large corpus of battery research papers in a self - supervised fashion. It was pretrained with the Masked language modeling (MLM) objective. The model randomly masks 15% of the words in the input sentence and then predicts the masked words, which allows it to learn a bidirectional representation of the sentence. This inner representation of the English language can be used to extract features for downstream tasks.
Training data
The BatteryOnlyBERT model was pretrained on the full text of battery papers only. The paper corpus contains 1.87B tokens from a total of 400,366 battery research papers published from 2000 to June 2021, from publishers Royal Society of Chemistry (RSC), Elsevier, and Springer. The list of DOIs can be found at Github.
Training procedure
Preprocessing
The texts are lowercased and tokenized using WordPiece with a vocabulary size of 28,996. The model inputs are in the form of [CLS] Sentence A [SEP] Sentence B [SEP]
. The masking procedure for each sentence is as follows:
- 15% of the tokens are masked.
- In 80% of the cases, the masked tokens are replaced by
[MASK]
.
- In 10% of the cases, the masked tokens are replaced by a random token.
- In the remaining 10% of the cases, the masked tokens are left as is.
Pretraining
The model was trained on 8 NVIDIA DGX A100 GPUs for 1,500,000 steps with a batch size of 256. The sequence length was limited to 512 tokens. The optimizer used is Adam with a learning rate of 1e - 4, \(\beta_{1}=0.9\) and \(\beta_{2}=0.999\), a weight decay of 0.01, learning rate warm - up for 10,000 steps, and linear decay of the learning rate after.
Intended uses & limitations
You can use the raw model for masked language modeling, but it's mostly intended to be fine - tuned on downstream tasks. This model is primarily aimed at tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification, or question answering. For text generation tasks, you should look at models like GPT2.
Evaluation results
Final loss: 1.0614.
đ§ Technical Details
The model uses the Masked language modeling (MLM) objective to learn bidirectional sentence representations. It randomly masks words in the input sentence and predicts the masked words, which is different from traditional RNNs and autoregressive models like GPT. The preprocessing of the training data involves lowercasing and tokenizing the texts using WordPiece, and the training is done on a large corpus of battery research papers.
đ License
This model is licensed under the Apache - 2.0 license.
đ Information Table
Property |
Details |
Model Type |
Pretrained transformers model on battery research papers using MLM |
Training Data |
Full text of 400,366 battery research papers (2000 - June 2021) from RSC, Elsevier, and Springer, containing 1.87B tokens. DOI list: Github |
đĨ Authors
- Shu Huang:
sh2009 [at] cam.ac.uk
- Jacqueline Cole:
jmc61 [at] cam.ac.uk
đ Citation
BatteryBERT: A Pre - trained Language Model for Battery Database Enhancement