MaterialsBERT Open-source NLP Model - Free Deployment to Assist in Handling Tasks Related to Materials Science

Materialsbert

Developed by pranav-s

MaterialsBERT is a natural language processing model fine-tuned on data from the materials science domain, and it performs excellently in tasks related to materials science.

Large Language Model

Transformers

EnglishOpen Source License:Other #Materials Science NLP #Domain Fine-tuning #Academic Abstract Processing

Downloads 287

Release Time : 1/4/2023

Model Overview

MaterialsBERT is a specific fine-tuning of the PubMedBERT model in the materials science domain. Through training on 2.4 million materials science abstracts, its performance in materials science NLP tasks has been improved.

Model Features

Domain-specific Fine-tuning

Fine-tune the PubMedBERT model on a dataset of 2.4 million materials science abstracts to improve its performance in materials science NLP tasks.

Superior Performance

In the downstream sequence labeling tasks of materials science, it outperforms other baseline language models on three out of five datasets.

Biomedical Domain Foundation

Fine-tuned based on the PubMedBERT model, which has been pre-trained on biomedical literature and is similar to the materials science domain.

Model Capabilities

Materials Science Text Understanding

Materials Science Literature Abstract Analysis

Materials Science Sequence Labeling

Materials Science Text Classification

Use Cases

Materials Science Research

Material Property Extraction

Extract material property data from materials science literature

Outperforms other baseline models on specific datasets

Materials Science Literature Classification

Automatically classify materials science literature

🚀 MaterialsBERT

MaterialsBERT is a fine - tuned language model. It enhances the performance in materials - science related NLP tasks by fine - tuning the PubMedBERT on a large materials science abstract dataset. It provides a powerful tool for researchers and practitioners in the materials science field.

🚀 Quick Start

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import BertForMaskedLM, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('pranav-s/MaterialsBERT')
model = BertForMaskedLM.from_pretrained('pranav-s/MaterialsBERT')
text = "Enter any text you like"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

✨ Features

Domain - specific fine - tuning: Fine - tuned on a large materials science abstract dataset, which can improve the performance on downstream NLP tasks in materials science.
Based on PubMedBERT: Chosen because the biomedical domain is close to the materials science domain, leveraging the pre - trained knowledge of PubMedBERT.
Versatile downstream applications: Suitable for various downstream tasks such as sequence classification, token classification, or question answering in materials science.

📚 Documentation

Model description

Domain - specific fine - tuning has been shown to improve performance in downstream performance on a variety of NLP tasks. MaterialsBERT fine - tunes PubMedBERT, a pre - trained language model trained using biomedical literature. This model was chosen as the biomedical domain is close to the materials science domain. MaterialsBERT when further fine - tuned on a variety of downstream sequence labeling tasks in materials science, outperformed other baseline language models tested on three out of five datasets.

Intended uses & limitations

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine - tuned on materials - science relevant downstream tasks.

Note that this model is primarily aimed at being fine - tuned on tasks that use a sentence or a paragraph (potentially masked) to make decisions, such as sequence classification, token classification or question answering.

Training data

A fine - tuning corpus of 2.4 million materials science abstracts was used. The DOI's of the journal articles used are provided in the file training_DOI.txt

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e - 05
train_batch_size: 32
eval_batch_size: 32
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
num_epochs: 3.0
mixed_precision_training: Native AMP

Framework versions

Transformers 4.17.0
Pytorch 1.10.2
Datasets 1.18.3
Tokenizers 0.11.0

🔧 Technical Details

The model is a fine - tuned version of [PubMedBERT model](https://huggingface.co/microsoft/BiomedNLP - PubMedBERT - base - uncased - abstract - fulltext) on a dataset of 2.4 million materials science abstracts. It was introduced in [this](https://www.nature.com/articles/s41524 - 023 - 01003 - w) paper. This model is uncased.

📄 License

The license of this model is other.

Citation

If you find MaterialsBERT useful in your research, please cite the following paper:

@article{materialsbert,
  title={A general - purpose material property data extraction pipeline from large polymer corpora using natural language processing},
  author={Shetty, Pranav and Rajan, Arunkumar Chitteth and Kuenneth, Chris and Gupta, Sonakshi and Panchumarti, Lakshmi Prerana and Holm, Lauren and Zhang, Chao and Ramprasad, Rampi},
  journal={npj Computational Materials},
  volume={9},
  number={1},
  pages={52},
  year={2023},
  publisher={Nature Publishing Group UK London}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご