🚀 GP-MoLFormer-Uniq
GP-MoLFormer is a class of models pretrained on SMILES string representations of 0.65 - 1.1B molecules from ZINC and PubChem. This repository focuses on the model pretrained on all the unique molecules from both datasets.
🚀 Quick Start
Use the code below to get started with the model.
Basic Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("ibm-research/GP-MoLFormer-Uniq", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ibm-research/MoLFormer-XL-both-10pct", trust_remote_code=True)
outputs = model.generate(do_sample=True, top_k=None, max_length=202, num_return_sequences=3)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
✨ Features
GP-MoLFormer is a large-scale autoregressive chemical language model intended for molecule generation tasks. It employs the same architecture as MoLFormer-XL, including linear attention and rotary position embeddings, but uses decoder-only Transformer blocks trained with a causal language modeling objective. It is trained on up to 1.1B molecules in SMILES representation.
The pretrained model may be used out-of-the-box for unconditional, de novo molecule generation. It can also be prompted with a partial SMILES string to do scaffold completion/decoration. It can be fine-tuned on a particular dataset to change the output distribution (e.g., more druglike) or tuned for molecular optimization using pair-tuning.
📦 Installation
No specific installation steps are provided in the original document.
📚 Documentation
Model Details
Model Description
GP-MoLFormer is a large-scale autoregressive chemical language model intended for molecule generation tasks. GP-MoLFormer employs the same architecture as MoLFormer-XL, including linear attention and rotary position embeddings, but uses decoder-only Transformer blocks trained with a causal language modeling objective. It is trained on up to 1.1B molecules in SMILES representation.
GP-MoLFormer was evaluated on de novo generation (at scale), scaffold-constrained decoration, and molecular property optimization tasks.
Intended use and limitations
The pretrained model may be used out-of-the-box for unconditional, de novo molecule generation. It can also be prompted with a partial SMILES string to do scaffold completion/decoration. We also demonstrate it can be fine-tuned on a particular dataset to change the output distribution (e.g., more druglike) or tuned for molecular optimization using pair-tuning. For details, see the paper and GitHub repository.
This model is not tested for classification performance. It is also not tested for molecules larger than ~200 atoms (i.e., macromolecules). Furthermore, using invalid or noncanonical SMILES may result in worse performance.
Training Details
Data
We trained GP-MoLFormer on a combination of molecules from the ZINC15 and PubChem datasets. This repository contains the version trained on all unique molecules from both datasets.
Molecules were canonicalized with RDKit prior to training and isomeric information was removed. Also, molecules longer than 202 tokens were dropped.
Hardware
- 16 x NVIDIA A100 80GB GPUs
Evaluation
We evaluated GP-MoLFormer on various generation metrics. The tables below show the performance of GP-MoLFormer-Uniq compared to baseline models:
Property |
Details |
Model Type |
Large - scale autoregressive chemical language model |
Training Data |
Combination of molecules from ZINC15 and PubChem datasets (all unique molecules) |
|
Val↑ |
Uniq@10k↑ |
Nov↑ |
Frag↑ |
Scaf↑ |
SNN↑ |
IntDiv↑ |
FCD↓ |
CharRNN |
0.975 |
0.999 |
0.842 |
0.9998 |
0.9242 |
0.6015 |
0.8562 |
0.0732 |
VAE |
0.977 |
0.998 |
0.695 |
0.9984 |
0.9386 |
0.6257 |
0.8558 |
0.0990 |
JT - VAE |
1.000 |
1.000 |
0.914 |
0.9965 |
0.8964 |
0.5477 |
0.8551 |
0.3954 |
LIMO |
1.000 |
0.976 |
1.000 |
0.6989 |
0.0079 |
0.2464 |
0.9039 |
26.78 |
MolGen - 7B |
1.000 |
1.000 |
0.934 |
0.9999 |
0.6538 |
0.5138 |
0.8617 |
0.0435 |
GP - MolFormer - Uniq |
1.000 |
0.977 |
0.390 |
0.9998 |
0.7383 |
0.5045 |
0.8655 |
0.0591 |
We report all metrics using the typical MOSES definitions on each model's respective test set. Note: novelty is with respect to each model's respective training set.
🔧 Technical Details
The model employs the same architecture as MoLFormer - XL, including linear attention and rotary position embeddings, and uses decoder - only Transformer blocks trained with a causal language modeling objective.
📄 License
The model is released under the Apache 2.0 license.
📖 Citation
@misc{ross2025gpmolformerfoundationmodelmolecular,
title={GP-MoLFormer: A Foundation Model For Molecular Generation},
author={Jerret Ross and Brian Belgodere and Samuel C. Hoffman and Vijil Chenthamarakshan and Jiri Navratil and Youssef Mroueh and Payel Das},
year={2025},
eprint={2405.04912},
archivePrefix={arXiv},
primaryClass={q-bio.BM},
url={https://arxiv.org/abs/2405.04912},
}