MoLFormer-XL-both-10pct Open-source Chemical Language Model - Empowering Chemical Research and Applications with Molecular Data

Molformer XL Both 10pct

Developed by ibm-research

MoLFormer is a chemical language model pre-trained on 1.1 billion molecular SMILES strings from ZINC and PubChem. This version uses 10% samples from each dataset for training.

Molecular Model

Transformers

Open Source License:Apache-2.0 #Chemical Molecular Characterization #SMILES Processing #Drug Discovery

Downloads 171.96k

Release Time : 10/20/2023

Model Overview

A chemical language model based on linear attention Transformer architecture, primarily used for molecular feature extraction and property prediction tasks.

Model Features

Efficient Attention Mechanism

Utilizes linear attention Transformer architecture, significantly reducing computational complexity.

Dual Dataset Pretraining

Trained simultaneously on ZINC15 and PubChem datasets, covering a broader chemical space.

Molecular Representation Learning

Captures relationships between molecular structure and properties through self-supervised learning.

Model Capabilities

Molecular Feature Extraction

Molecular Property Prediction

Molecular Similarity Calculation

Use Cases

Drug Discovery

Solubility Prediction

Predicts water solubility of compounds.

RMSE of 0.3295 on ESOL dataset.

Toxicity Prediction

Evaluates compound toxicity.

AUROC of 84.5 on Tox21 dataset.

Materials Science

Quantum Chemical Property Prediction

Predicts quantum mechanical properties of molecules.

MAE of 1.7754 on QM9 dataset.

🚀 MoLFormer-XL-both-10%

MoLFormer is a class of models pretrained on SMILES string representations of up to 1.1B molecules from ZINC and PubChem. This repository is for the model pretrained on 10% of both datasets.

🚀 Quick Start

You can use the code below to get started with the model.

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("ibm/MoLFormer-XL-both-10pct", deterministic_eval=True, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ibm/MoLFormer-XL-both-10pct", trust_remote_code=True)

smiles = ["Cn1c(=O)c2c(ncn2C)n(C)c1=O", "CC(=O)Oc1ccccc1C(=O)O"]
inputs = tokenizer(smiles, padding=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
outputs.pooler_output

✨ Features

Feature Extraction: Can be used as a feature extractor for various molecular analysis tasks.
Fine - Tuning: Suitable for fine - tuning on different downstream molecular property prediction tasks.
Masked Language Modeling: Can be applied to masked language modeling tasks.

📦 Installation

The README doesn't provide specific installation steps. You can install the necessary libraries as shown in the example code:

pip install torch transformers

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("ibm/MoLFormer-XL-both-10pct", deterministic_eval=True, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ibm/MoLFormer-XL-both-10pct", trust_remote_code=True)

smiles = ["Cn1c(=O)c2c(ncn2C)n(C)c1=O", "CC(=O)Oc1ccccc1C(=O)O"]
inputs = tokenizer(smiles, padding=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
outputs.pooler_output

📚 Documentation

Model Details

Model Description

MoLFormer is a large - scale chemical language model designed to learn a model trained on small molecules represented as SMILES strings. It leverages masked language modeling and employs a linear attention Transformer combined with rotary embeddings.

MoLFormer pipeline

The MoLFormer pipeline is shown in the image above. The transformer - based neural network model is trained on a large collection of chemical molecules from the ZINC and PubChem datasets in a self - supervised fashion. It uses an efficient linear attention mechanism and relative positional embeddings to learn a meaningful and compressed representation of chemical molecules. After training, the MoLFormer foundation model can be fine - tuned for different downstream molecular property prediction tasks. To test its representative power, the MoLFormer encodings are used to recover molecular similarity, and the correspondence between the interatomic spatial distance and attention value for a given molecule is analyzed.

Intended use and limitations

You can use the model for masked language modeling, but it is mainly intended to be used as a feature extractor or to be fine - tuned for a prediction task. The "frozen" model embeddings can be used for similarity measurements, visualization, or training predictor models. The model can also be fine - tuned for sequence classification tasks (e.g., solubility, toxicity, etc.).

This model is not intended for molecule generation. It is not tested for molecules larger than ~200 atoms (i.e., macromolecules). Using invalid or noncanonical SMILES may result in worse performance.

Training Details

Data

We trained MoLFormer - XL on a combination of molecules from the ZINC15 and PubChem datasets. This repository contains the version trained on 10% ZINC + 10% PubChem.

Molecules were canonicalized with RDKit prior to training and isomeric information was removed. Also, molecules longer than 202 tokens were dropped.

Hardware

16 x NVIDIA V100 GPUs

Evaluation

We evaluated MoLFormer by fine - tuning on 11 benchmark tasks from MoleculeNet. The tables below show the performance of different MoLFormer variants:

Property	BBBP	HIV	BACE	SIDER	ClinTox	Tox21
10% ZINC + 10% PubChem	91.5	81.3	86.6	68.9	94.6	84.5
10% ZINC + 100% PubChem	92.2	79.2	86.3	69.0	94.7	84.5
100% ZINC	89.9	78.4	87.7	66.8	82.2	83.2
MoLFormer - Base	90.9	77.7	82.8	64.8	61.3	43.1
MoLFormer - XL	93.7	82.2	88.2	69.0	94.8	84.7

Property	QM9	QM8	ESOL	FreeSolv	Lipophilicity
10% ZINC + 10% PubChem	1.7754	0.0108	0.3295	0.2221	0.5472
10% ZINC + 100% PubChem	1.9093	0.0102	0.2775	0.2050	0.5331
100% ZINC	1.9403	0.0124	0.3023	0.2981	0.5440
MoLFormer - Base	2.2500	0.0111	0.2798	0.2596	0.6492
MoLFormer - XL	1.5984	0.0102	0.2787	0.2308	0.5298

We report AUROC for all classification tasks, average MAE for QM9/8, and RMSE for the remaining regression tasks.

📄 License

This project is licensed under the Apache - 2.0 license.

🔧 Technical Details

The model uses masked language modeling and a linear attention Transformer combined with rotary embeddings. It is trained on a large collection of chemical molecules from the ZINC and PubChem datasets in a self - supervised fashion. The linear attention mechanism and relative positional embeddings help it learn a meaningful and compressed representation of chemical molecules.

📖 Citation

@article{10.1038/s42256-022-00580-7,
  year = {2022},
  title = {{Large-scale chemical language representations capture molecular structure and   properties}},
  author = {Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and   Mroueh, Youssef and Das, Payel},
  journal = {Nature Machine Intelligence},
  doi = {10.1038/s42256-022-00580-7},
  pages = {1256--1264},
  number = {12},
  volume = {4}
}

@misc{https://doi.org/10.48550/arxiv.2106.09553,
  doi = {10.48550/ARXIV.2106.09553},
  url = {https://arxiv.org/abs/2106.09553},
  author = {Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel},
  keywords = {Machine Learning (cs.LG), Computation and Language (cs.CL), Biomolecules (q-bio.BM), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Biological sciences, FOS: Biological sciences},
  title = {Large-Scale Chemical Language Representations Capture Molecular Structure and Properties},
  publisher = {arXiv},
  year = {2021},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご