đ MaterialsBERT
MaterialsBERT is a fine - tuned language model. It enhances the performance in materials - science related NLP tasks by fine - tuning the PubMedBERT on a large materials science abstract dataset. It provides a powerful tool for researchers and practitioners in the materials science field.
đ Quick Start
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import BertForMaskedLM, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('pranav-s/MaterialsBERT')
model = BertForMaskedLM.from_pretrained('pranav-s/MaterialsBERT')
text = "Enter any text you like"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
⨠Features
- Domain - specific fine - tuning: Fine - tuned on a large materials science abstract dataset, which can improve the performance on downstream NLP tasks in materials science.
- Based on PubMedBERT: Chosen because the biomedical domain is close to the materials science domain, leveraging the pre - trained knowledge of PubMedBERT.
- Versatile downstream applications: Suitable for various downstream tasks such as sequence classification, token classification, or question answering in materials science.
đ Documentation
Model description
Domain - specific fine - tuning has been shown to improve performance in downstream performance on a variety of NLP tasks. MaterialsBERT fine - tunes PubMedBERT, a pre - trained language model trained using biomedical literature. This model was chosen as the biomedical domain is close to the materials science domain. MaterialsBERT when further fine - tuned on a variety of downstream sequence labeling tasks in materials science, outperformed other baseline language models tested on three out of five datasets.
Intended uses & limitations
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine - tuned on materials - science relevant downstream tasks.
Note that this model is primarily aimed at being fine - tuned on tasks that use a sentence or a paragraph (potentially masked) to make decisions, such as sequence classification, token classification or question answering.
Training data
A fine - tuning corpus of 2.4 million materials science abstracts was used. The DOI's of the journal articles used are provided in the file training_DOI.txt
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e - 05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
- lr_scheduler_type: linear
- num_epochs: 3.0
- mixed_precision_training: Native AMP
Framework versions
- Transformers 4.17.0
- Pytorch 1.10.2
- Datasets 1.18.3
- Tokenizers 0.11.0
đ§ Technical Details
The model is a fine - tuned version of [PubMedBERT model](https://huggingface.co/microsoft/BiomedNLP - PubMedBERT - base - uncased - abstract - fulltext) on a dataset of 2.4 million materials science abstracts. It was introduced in [this](https://www.nature.com/articles/s41524 - 023 - 01003 - w) paper. This model is uncased.
đ License
The license of this model is other.
Citation
If you find MaterialsBERT useful in your research, please cite the following paper:
@article{materialsbert,
title={A general - purpose material property data extraction pipeline from large polymer corpora using natural language processing},
author={Shetty, Pranav and Rajan, Arunkumar Chitteth and Kuenneth, Chris and Gupta, Sonakshi and Panchumarti, Lakshmi Prerana and Holm, Lauren and Zhang, Chao and Ramprasad, Rampi},
journal={npj Computational Materials},
volume={9},
number={1},
pages={52},
year={2023},
publisher={Nature Publishing Group UK London}
}