materials.smi - TED Open-source Chemical Model - Free to Realize Molecular Representation Conversion and Quantum Property Prediction

Materials.smi Ted

Developed by ibm-research

Chemical language foundation model proposed by IBM, supporting various tasks such as molecular representation conversion and quantum property prediction

Molecular Model

Transformers

Open Source License:Apache-2.0 #Molecular Representation Learning #Chemical Language Models #Multimodal Material Modeling

Downloads 20.65k

Release Time : 7/25/2024

Model Overview

SMI-TED is a large chemical foundation encoder-decoder model based on SMILES, pre-trained on 91 million molecular samples, supporting complex tasks like molecular representation conversion and quantum property prediction

Model Features

Multimodal Molecular Representation

Supports various molecular representations including SMILES strings, SELFIES encoding, and 3D atomic coordinates

Large-scale Pretraining

Pre-trained on 91 million molecular samples (4 billion tokens) from PubChem

Dual Training Strategy

Combines masked language modeling and encoder-decoder strategies to optimize model performance

Model Capabilities

Molecular representation conversion

Quantum property prediction

SMILES encoding and decoding

Molecular feature extraction

Use Cases

Material Discovery

Novel Molecule Design

Generates potential new compounds through molecular representation learning

Drug Development

Molecular Property Prediction

Predicts quantum chemical properties of drug candidates

Demonstrated excellent performance on MoleculeNet benchmark tests

🚀 Introduction to IBM's Foundation Models for Materials

IBM's large foundation models for sustainable materials support and advance research in materials science and chemistry, spanning various representations and modalities.

GitHub: GitHub Link

Paper: arXiv:2407.20267

🚀 Quick Start

Pretrained Models and Training Logs

We provide checkpoints of the SMI - TED model pre - trained on a dataset of ~91M molecules curated from PubChem. The pre - trained model shows competitive performance on classification and regression benchmarks from MoleculeNet.

Add the SMI - TED pre - trained weights.pt to the inference/ or finetune/ directory according to your needs. The directory structure should look like the following:

inference/
├── smi_ted_light
│   ├── smi_ted_light.pt
│   ├── bert_vocab_curated.txt
│   └── load.py

and/or:

finetune/
├── smi_ted_light
│   ├── smi_ted_light.pt
│   ├── bert_vocab_curated.txt
│   └── load.py

Replicating Conda Environment

Follow these steps to replicate our Conda environment and install the necessary libraries:

Create and Activate Conda Environment

conda create --name smi - ted - env python=3.10
conda activate smi - ted - env

Install Packages with Conda

conda install pytorch=2.1.0 pytorch - cuda=11.8 - c pytorch - c nvidia

Install Packages with Pip

pip install -r requirements.txt
pip install pytorch - fast - transformers

✨ Features

We present a large encoder - decoder chemical foundation model, SMILES - based Transformer Encoder - Decoder (SMI - TED), pre - trained on a curated dataset of 91 million SMILES samples sourced from PubChem, equivalent to 4 billion molecular tokens. SMI - TED supports various complex tasks, including quantum property prediction, with two main variants (289M and 8X289M). Our experiments across multiple benchmark datasets demonstrate state - of - the - art performance for various tasks.

📦 Installation

This code and environment have been tested on Nvidia V100s and Nvidia A100s.

Pretrained Models and Training Logs

Add the SMI - TED pre - trained weights.pt to the inference/ or finetune/ directory according to your needs.

Replicating Conda Environment

conda create --name smi - ted - env python=3.10
conda activate smi - ted - env
conda install pytorch=2.1.0 pytorch - cuda=11.8 - c pytorch - c nvidia
pip install -r requirements.txt
pip install pytorch - fast - transformers

💻 Usage Examples

Basic Usage

To load smi - ted:

model = load_smi_ted(
    folder='../inference/smi_ted_light',
    ckpt_filename='smi_ted_light.pt'
)

To encode SMILES into embeddings:

with torch.no_grad():
    encoded_embeddings = model.encode(df['SMILES'], return_torch=True)

To decode embeddings to SMILES strings:

with torch.no_grad():
    decoded_smiles = model.decode(encoded_embeddings)

Advanced Usage

with open('model_weights.bin', 'rb') as f:
    state_dict = torch.load(f)
    model.load_state_dict(state_dict)

📚 Documentation

Pretraining

For pretraining, we use two strategies: the masked language model method to train the encoder part and an encoder - decoder strategy to refine SMILES reconstruction and improve the generated latent space.

SMI - TED is pre - trained on canonicalized and curated 91M SMILES from PubChem with the following constraints:

Compounds are filtered to a maximum length of 202 tokens during preprocessing.
A 95/5/0 split is used for encoder training, with 5% of the data for decoder pretraining.
A 100/0/0 split is also used to train the encoder and decoder directly, enhancing model performance.

The pretraining code provides examples of data processing and model training on a smaller dataset, requiring 8 A100 GPUs.

To pre - train the two variants of the SMI - TED model, run:

bash training/run_model_light_training.sh

bash training/run_model_large_training.sh

Use train_model_D.py to train only the decoder or train_model_ED.py to train both the encoder and decoder.

Finetuning

The finetuning datasets and environment can be found in the [finetune](https://github.com/IBM/materials/tree/main/smi - ted/finetune) directory. After setting up the environment, you can run a finetuning task with:

bash finetune/smi_ted_light/esol/run_finetune_esol.sh

Finetuning training/checkpointing resources will be available in directories named checkpoint_<measure_name>.

Feature Extraction

The example notebook [smi_ted_encoder_decoder_example.ipynb](https://github.com/IBM/materials/blob/main/smi - ted/notebooks/smi_ted_encoder_decoder_example.ipynb) contains code to load checkpoint files and use the pre - trained model for encoder and decoder tasks. It also includes examples of classification and regression tasks.

🔧 Technical Details

📄 License

The project is under the Apache - 2.0 license.

Citations

@misc{soares2024largeencoderdecoderfamilyfoundation,
      title={A Large Encoder - Decoder Family of Foundation Models For Chemical Language}, 
      author={Eduardo Soares and Victor Shirasuna and Emilio Vital Brazil and Renato Cerqueira and Dmitry Zubarev and Kristin Schmidt},
      year={2024},
      eprint={2407.20267},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2407.20267}, 
}

Additional Information

This repository provides PyTorch source code associated with our publication, "A Large Encoder - Decoder Family of Foundation Models for Chemical Language".

We provide the model weights in two formats:

PyTorch (.pt): [smi - ted - Light_40.pt](smi - ted - Light_40.pt)
safetensors (.safetensors): model_weights.safetensors

For more information contact: eduardo.soares@ibm.com or evital@br.ibm.com.

Property	Details
Model Type	SMILES - based Transformer Encoder - Decoder (SMI - TED)
Training Data	91 million SMILES samples sourced from PubChem, equivalent to 4 billion molecular tokens
Metrics	accuracy
Pipeline Tag	feature - extraction
Tags	chemistry, foundation models, AI4Science, materials, molecules, safetensors, pytorch, transformer, diffusers
Library Name	transformers

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご