BatteryOnlyBERT-Uncased Open-Source Model - Focusing on text understanding in the battery field and accurately grasping industry information

Batteryonlybert Uncased

Developed by batterydata

A case-sensitive BERT model pre-trained on battery research paper corpus, specializing in battery domain text understanding

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #Specialized for Battery Research #Scientific Literature Pre-training #Bidirectional Context Modeling

Downloads 16

Release Time : 3/3/2022

Model Overview

This model is pre-trained on battery research papers through masked language modeling, suitable for downstream NLP tasks in the battery field

Model Features

Domain Specialization

Specially optimized for battery research domain texts

Case Sensitivity

Capable of recognizing and processing case-sensitive texts

Large-scale Pre-training

Trained on a 1.87 billion token battery research paper corpus

Model Capabilities

Text Feature Extraction

Masked Language Prediction

Sequence Classification

Token Classification

Question Answering

Use Cases

Academic Research

Battery Literature Analysis

Extract key information and features from battery research papers

Scientific Text Classification

Classify battery-related research literature

Industrial Applications

Patent Analysis

Analyze battery technology patent texts

🚀 BatteryOnlyBERT-cased model

A pretrained model on a large corpus of battery research papers, using masked language modeling (MLM) to learn bidirectional sentence representations for downstream tasks.

🚀 Quick Start

The BatteryOnlyBERT-cased model is a transformers model pretrained on a large corpus of battery research papers. You can use it directly for masked language modeling or fine - tune it for downstream tasks.

Here is an example of using it with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='batterydata/batteryonlybert-cased')
>>> unmasker("Hello I'm a <mask> model.")

✨ Features

Bidirectional Representation: Pretrained with the Masked language modeling (MLM) objective, allowing it to learn a bidirectional representation of the sentence.
Case - Sensitive: It makes a difference between words with different capitalizations, like "english" and "English".
Useful for Downstream Tasks: Can be used to extract features for downstream tasks such as sequence classification, token classification, or question answering.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

# Use the model with a pipeline for masked language modeling
from transformers import pipeline
unmasker = pipeline('fill-mask', model='batterydata/batteryonlybert-cased')
result = unmasker("Hello I'm a <mask> model.")
print(result)

Advanced Usage

# Use the model to get the features of a given text in PyTorch
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('batterydata/batteryonlybert-cased')
model = BertModel.from_pretrained('batterydata/batteryonlybert-cased')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
print(output)

# Use the model to get the features of a given text in TensorFlow
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('batterydata/batteryonlybert-cased')
model = TFBertModel.from_pretrained('batterydata/batteryonlybert-cased')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
print(output)

📚 Documentation

Model description

BatteryOnlyBERT is a transformers model pretrained on a large corpus of battery research papers in a self - supervised fashion. It was pretrained with the Masked language modeling (MLM) objective. The model randomly masks 15% of the words in the input sentence and then predicts the masked words, which allows it to learn a bidirectional representation of the sentence. This inner representation of the English language can be used to extract features for downstream tasks.

Training data

The BatteryOnlyBERT model was pretrained on the full text of battery papers only. The paper corpus contains 1.87B tokens from a total of 400,366 battery research papers published from 2000 to June 2021, from publishers Royal Society of Chemistry (RSC), Elsevier, and Springer. The list of DOIs can be found at Github.

Training procedure

Preprocessing

The texts are lowercased and tokenized using WordPiece with a vocabulary size of 28,996. The model inputs are in the form of [CLS] Sentence A [SEP] Sentence B [SEP]. The masking procedure for each sentence is as follows:

15% of the tokens are masked.
In 80% of the cases, the masked tokens are replaced by [MASK].
In 10% of the cases, the masked tokens are replaced by a random token.
In the remaining 10% of the cases, the masked tokens are left as is.

Pretraining

The model was trained on 8 NVIDIA DGX A100 GPUs for 1,500,000 steps with a batch size of 256. The sequence length was limited to 512 tokens. The optimizer used is Adam with a learning rate of 1e - 4, \(\beta_{1}=0.9\) and \(\beta_{2}=0.999\), a weight decay of 0.01, learning rate warm - up for 10,000 steps, and linear decay of the learning rate after.

Intended uses & limitations

You can use the raw model for masked language modeling, but it's mostly intended to be fine - tuned on downstream tasks. This model is primarily aimed at tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification, or question answering. For text generation tasks, you should look at models like GPT2.

Evaluation results

Final loss: 1.0614.

🔧 Technical Details

The model uses the Masked language modeling (MLM) objective to learn bidirectional sentence representations. It randomly masks words in the input sentence and predicts the masked words, which is different from traditional RNNs and autoregressive models like GPT. The preprocessing of the training data involves lowercasing and tokenizing the texts using WordPiece, and the training is done on a large corpus of battery research papers.

📄 License

This model is licensed under the Apache - 2.0 license.

📋 Information Table

Property	Details
Model Type	Pretrained transformers model on battery research papers using MLM
Training Data	Full text of 400,366 battery research papers (2000 - June 2021) from RSC, Elsevier, and Springer, containing 1.87B tokens. DOI list: Github

👥 Authors

Shu Huang: sh2009 [at] cam.ac.uk
Jacqueline Cole: jmc61 [at] cam.ac.uk

📖 Citation

BatteryBERT: A Pre - trained Language Model for Battery Database Enhancement

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご