đ Introduction to IBM's Foundation Models for Materials
IBM's large foundation models for sustainable materials support and advance research in materials science and chemistry, spanning various representations and modalities.
GitHub: GitHub Link
Paper: arXiv:2407.20267
đ Quick Start
Pretrained Models and Training Logs
We provide checkpoints of the SMI - TED model pre - trained on a dataset of ~91M molecules curated from PubChem. The pre - trained model shows competitive performance on classification and regression benchmarks from MoleculeNet.
Add the SMI - TED pre - trained weights.pt
to the inference/
or finetune/
directory according to your needs. The directory structure should look like the following:
inference/
âââ smi_ted_light
â âââ smi_ted_light.pt
â âââ bert_vocab_curated.txt
â âââ load.py
and/or:
finetune/
âââ smi_ted_light
â âââ smi_ted_light.pt
â âââ bert_vocab_curated.txt
â âââ load.py
Replicating Conda Environment
Follow these steps to replicate our Conda environment and install the necessary libraries:
Create and Activate Conda Environment
conda create --name smi - ted - env python=3.10
conda activate smi - ted - env
Install Packages with Conda
conda install pytorch=2.1.0 pytorch - cuda=11.8 - c pytorch - c nvidia
Install Packages with Pip
pip install -r requirements.txt
pip install pytorch - fast - transformers
⨠Features
We present a large encoder - decoder chemical foundation model, SMILES - based Transformer Encoder - Decoder (SMI - TED), pre - trained on a curated dataset of 91 million SMILES samples sourced from PubChem, equivalent to 4 billion molecular tokens. SMI - TED supports various complex tasks, including quantum property prediction, with two main variants (289M and 8X289M). Our experiments across multiple benchmark datasets demonstrate state - of - the - art performance for various tasks.
đĻ Installation
This code and environment have been tested on Nvidia V100s and Nvidia A100s.
Pretrained Models and Training Logs
Add the SMI - TED pre - trained weights.pt
to the inference/
or finetune/
directory according to your needs.
Replicating Conda Environment
conda create --name smi - ted - env python=3.10
conda activate smi - ted - env
conda install pytorch=2.1.0 pytorch - cuda=11.8 - c pytorch - c nvidia
pip install -r requirements.txt
pip install pytorch - fast - transformers
đģ Usage Examples
Basic Usage
To load smi - ted:
model = load_smi_ted(
folder='../inference/smi_ted_light',
ckpt_filename='smi_ted_light.pt'
)
To encode SMILES into embeddings:
with torch.no_grad():
encoded_embeddings = model.encode(df['SMILES'], return_torch=True)
To decode embeddings to SMILES strings:
with torch.no_grad():
decoded_smiles = model.decode(encoded_embeddings)
Advanced Usage
with open('model_weights.bin', 'rb') as f:
state_dict = torch.load(f)
model.load_state_dict(state_dict)
đ Documentation
Pretraining
For pretraining, we use two strategies: the masked language model method to train the encoder part and an encoder - decoder strategy to refine SMILES reconstruction and improve the generated latent space.
SMI - TED is pre - trained on canonicalized and curated 91M SMILES from PubChem with the following constraints:
- Compounds are filtered to a maximum length of 202 tokens during preprocessing.
- A 95/5/0 split is used for encoder training, with 5% of the data for decoder pretraining.
- A 100/0/0 split is also used to train the encoder and decoder directly, enhancing model performance.
The pretraining code provides examples of data processing and model training on a smaller dataset, requiring 8 A100 GPUs.
To pre - train the two variants of the SMI - TED model, run:
bash training/run_model_light_training.sh
or
bash training/run_model_large_training.sh
Use train_model_D.py
to train only the decoder or train_model_ED.py
to train both the encoder and decoder.
Finetuning
The finetuning datasets and environment can be found in the [finetune](https://github.com/IBM/materials/tree/main/smi - ted/finetune) directory. After setting up the environment, you can run a finetuning task with:
bash finetune/smi_ted_light/esol/run_finetune_esol.sh
Finetuning training/checkpointing resources will be available in directories named checkpoint_<measure_name>
.
Feature Extraction
The example notebook [smi_ted_encoder_decoder_example.ipynb](https://github.com/IBM/materials/blob/main/smi - ted/notebooks/smi_ted_encoder_decoder_example.ipynb) contains code to load checkpoint files and use the pre - trained model for encoder and decoder tasks. It also includes examples of classification and regression tasks.
đ§ Technical Details
We present a large encoder - decoder chemical foundation model, SMILES - based Transformer Encoder - Decoder (SMI - TED), pre - trained on a curated dataset of 91 million SMILES samples sourced from PubChem, equivalent to 4 billion molecular tokens. SMI - TED supports various complex tasks, including quantum property prediction, with two main variants (289M and 8X289M). Our experiments across multiple benchmark datasets demonstrate state - of - the - art performance for various tasks.
đ License
The project is under the Apache - 2.0 license.
Citations
@misc{soares2024largeencoderdecoderfamilyfoundation,
title={A Large Encoder - Decoder Family of Foundation Models For Chemical Language},
author={Eduardo Soares and Victor Shirasuna and Emilio Vital Brazil and Renato Cerqueira and Dmitry Zubarev and Kristin Schmidt},
year={2024},
eprint={2407.20267},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.20267},
}
Additional Information
This repository provides PyTorch source code associated with our publication, "A Large Encoder - Decoder Family of Foundation Models for Chemical Language".
We provide the model weights in two formats:
For more information contact: eduardo.soares@ibm.com or evital@br.ibm.com.
Property |
Details |
Model Type |
SMILES - based Transformer Encoder - Decoder (SMI - TED) |
Training Data |
91 million SMILES samples sourced from PubChem, equivalent to 4 billion molecular tokens |
Metrics |
accuracy |
Pipeline Tag |
feature - extraction |
Tags |
chemistry, foundation models, AI4Science, materials, molecules, safetensors, pytorch, transformer, diffusers |
Library Name |
transformers |