GP-MoLFormer-Uniq Open-Source Chemical Language Model - Freely Assist in Efficient Molecular Generation Tasks

GP MoLFormer Uniq

Developed by ibm-research

GP-MoLFormer is a chemical language model pretrained on 650 million to 1.1 billion molecular SMILES string representations from ZINC and PubChem, focusing on molecular generation tasks.

Molecular Model

Transformers

Open Source License:Apache-2.0 #Molecular Generation #SMILES Pretraining #Drug Discovery

Downloads 122

Release Time : 4/30/2025

Model Overview

GP-MoLFormer is a large-scale autoregressive chemical language model for molecular generation tasks, employing a decoder-only Transformer architecture with linear attention and rotary position encoding.

Model Features

Large-scale Pretraining

Pretrained on 650 million to 1.1 billion molecular SMILES strings from ZINC and PubChem

Unique Molecular Training

This version is pretrained on all unique molecules from both datasets

Versatile Molecular Generation

Supports unconditional de novo molecular generation, scaffold completion/modification, and molecular optimization

Efficient Architecture

Transformer architecture with linear attention and rotary position encoding for improved computational efficiency

Model Capabilities

Unconditional Molecular Generation

Scaffold-constrained Molecular Modification

Molecular Property Optimization

SMILES String Completion

Use Cases

Drug Discovery

De Novo Molecular Design

Generate entirely new potential drug molecular structures

Can produce valid, unique, and moderately novel molecules

Molecular Optimization

Optimize molecular properties through fine-tuning or paired tuning

Can adjust molecular distributions to be more drug-like

Chemical Research

Scaffold Modification

Generate variant molecules based on given molecular scaffolds

Maintains core structures while exploring chemical space

🚀 GP-MoLFormer-Uniq

GP-MoLFormer is a class of models pretrained on SMILES string representations of 0.65 - 1.1B molecules from ZINC and PubChem. This repository focuses on the model pretrained on all the unique molecules from both datasets.

🚀 Quick Start

Use the code below to get started with the model.

Basic Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ibm-research/GP-MoLFormer-Uniq", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ibm-research/MoLFormer-XL-both-10pct", trust_remote_code=True)

outputs = model.generate(do_sample=True, top_k=None, max_length=202, num_return_sequences=3)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

✨ Features

GP-MoLFormer is a large-scale autoregressive chemical language model intended for molecule generation tasks. It employs the same architecture as MoLFormer-XL, including linear attention and rotary position embeddings, but uses decoder-only Transformer blocks trained with a causal language modeling objective. It is trained on up to 1.1B molecules in SMILES representation.

The pretrained model may be used out-of-the-box for unconditional, de novo molecule generation. It can also be prompted with a partial SMILES string to do scaffold completion/decoration. It can be fine-tuned on a particular dataset to change the output distribution (e.g., more druglike) or tuned for molecular optimization using pair-tuning.

📦 Installation

No specific installation steps are provided in the original document.

📚 Documentation

Model Details

Model Description

GP-MoLFormer is a large-scale autoregressive chemical language model intended for molecule generation tasks. GP-MoLFormer employs the same architecture as MoLFormer-XL, including linear attention and rotary position embeddings, but uses decoder-only Transformer blocks trained with a causal language modeling objective. It is trained on up to 1.1B molecules in SMILES representation.

GP-MoLFormer was evaluated on de novo generation (at scale), scaffold-constrained decoration, and molecular property optimization tasks.

Intended use and limitations

The pretrained model may be used out-of-the-box for unconditional, de novo molecule generation. It can also be prompted with a partial SMILES string to do scaffold completion/decoration. We also demonstrate it can be fine-tuned on a particular dataset to change the output distribution (e.g., more druglike) or tuned for molecular optimization using pair-tuning. For details, see the paper and GitHub repository.

This model is not tested for classification performance. It is also not tested for molecules larger than ~200 atoms (i.e., macromolecules). Furthermore, using invalid or noncanonical SMILES may result in worse performance.

Training Details

Data

We trained GP-MoLFormer on a combination of molecules from the ZINC15 and PubChem datasets. This repository contains the version trained on all unique molecules from both datasets.

Molecules were canonicalized with RDKit prior to training and isomeric information was removed. Also, molecules longer than 202 tokens were dropped.

Hardware

16 x NVIDIA A100 80GB GPUs

Evaluation

We evaluated GP-MoLFormer on various generation metrics. The tables below show the performance of GP-MoLFormer-Uniq compared to baseline models:

Property	Details
Model Type	Large - scale autoregressive chemical language model
Training Data	Combination of molecules from ZINC15 and PubChem datasets (all unique molecules)

	Val↑	Uniq@10k↑	Nov↑	Frag↑	Scaf↑	SNN↑	IntDiv↑	FCD↓
CharRNN	0.975	0.999	0.842	0.9998	0.9242	0.6015	0.8562	0.0732
VAE	0.977	0.998	0.695	0.9984	0.9386	0.6257	0.8558	0.0990
JT - VAE	1.000	1.000	0.914	0.9965	0.8964	0.5477	0.8551	0.3954
LIMO	1.000	0.976	1.000	0.6989	0.0079	0.2464	0.9039	26.78
MolGen - 7B	1.000	1.000	0.934	0.9999	0.6538	0.5138	0.8617	0.0435
GP - MolFormer - Uniq	1.000	0.977	0.390	0.9998	0.7383	0.5045	0.8655	0.0591

We report all metrics using the typical MOSES definitions on each model's respective test set. Note: novelty is with respect to each model's respective training set.

🔧 Technical Details

The model employs the same architecture as MoLFormer - XL, including linear attention and rotary position embeddings, and uses decoder - only Transformer blocks trained with a causal language modeling objective.

📄 License

The model is released under the Apache 2.0 license.

📖 Citation

@misc{ross2025gpmolformerfoundationmodelmolecular,
      title={GP-MoLFormer: A Foundation Model For Molecular Generation}, 
      author={Jerret Ross and Brian Belgodere and Samuel C. Hoffman and Vijil Chenthamarakshan and Jiri Navratil and Youssef Mroueh and Payel Das},
      year={2025},
      eprint={2405.04912},
      archivePrefix={arXiv},
      primaryClass={q-bio.BM},
      url={https://arxiv.org/abs/2405.04912}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご