MoLFormer-XL-both-10pct開源化學語言模型 - 基於分子數據助力化學研究應用

首頁

Molformer XL Both 10pct

由ibm-research開發

MoLFormer是基於ZINC和PubChem中11億分子SMILES字符串預訓練的化學語言模型，本版本使用兩個數據集各10%樣本訓練

分子模型

Transformers

開源協議:Apache-2.0 #化學分子表徵 #SMILES處理 #藥物發現

下載量 171.96k

發布時間 : 10/20/2023

模型概述

採用線性注意力Transformer架構的化學語言模型，主要用於分子特徵提取和屬性預測任務

模型特點

高效注意力機制

採用線性注意力Transformer架構，顯著降低計算複雜度

雙數據集預訓練

同時使用ZINC15和PubChem數據集進行訓練，覆蓋更廣的化學空間

分子表徵學習

通過自監督學習捕獲分子結構與性質的關係

模型能力

分子特徵提取

分子屬性預測

分子相似性計算

使用案例

藥物發現

溶解度預測

預測化合物的水溶性

在ESOL數據集上RMSE為0.3295

毒性預測

評估化合物毒性

在Tox21數據集上AUROC達84.5

材料科學

量子化學性質預測

預測分子的量子力學性質

在QM9數據集上MAE為1.7754

🚀 MoLFormer-XL-both-10%

MoLFormer是一類在來自ZINC和PubChem的多達11億個分子的SMILES字符串表示上進行預訓練的模型。本倉庫是在這兩個數據集的10%數據上進行預訓練的模型。該模型由Ross等人在論文Large-Scale Chemical Language Representations Capture Molecular Structure and Properties中提出，並首次在此倉庫中發佈。

🚀 快速開始

使用以下代碼開始使用該模型：

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("ibm/MoLFormer-XL-both-10pct", deterministic_eval=True, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ibm/MoLFormer-XL-both-10pct", trust_remote_code=True)

smiles = ["Cn1c(=O)c2c(ncn2C)n(C)c1=O", "CC(=O)Oc1ccccc1C(=O)O"]
inputs = tokenizer(smiles, padding=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
outputs.pooler_output

✨ 主要特性

可用於掩碼語言建模，但主要用作特徵提取器或針對預測任務進行微調。
“凍結”的模型嵌入可用於相似度測量、可視化或訓練預測模型。
也可針對序列分類任務（如溶解度、毒性等）進行微調。

📚 詳細文檔

模型描述

MoLFormer是一個大規模化學語言模型，旨在學習在以SMILES字符串表示的小分子上訓練的模型。MoLFormer利用掩碼語言建模，並採用線性注意力Transformer結合旋轉嵌入。

MoLFormer pipeline

上圖展示了MoLFormer管道的概述。可以看到，基於Transformer的神經網絡模型以自監督的方式在來自兩個公共化學數據集PubChem和ZINC的由SMILES序列表示的大量化學分子上進行訓練。MoLFormer架構設計了高效的線性注意力機制和相對位置嵌入，目標是學習化學分子有意義且壓縮的表示。訓練後，MoLFormer基礎模型通過在特定任務數據上進行微調，被應用於不同的下游分子屬性預測任務。為了進一步測試MoLFormer的表示能力，使用MoLFormer編碼來恢復分子相似度，並對給定分子的原子間空間距離和注意力值之間的對應關係進行了分析。

預期用途和限制

該模型可用於掩碼語言建模，但主要用於特徵提取或針對預測任務進行微調。“凍結”的模型嵌入可用於相似度測量、可視化或訓練預測模型。該模型也可針對序列分類任務（如溶解度、毒性等）進行微調。

該模型不用於分子生成，也未對大於約200個原子的分子（即大分子）進行測試。此外，使用無效或非規範的SMILES可能會導致性能下降。

🔧 技術細節

訓練數據

我們在ZINC15和PubChem數據集的分子組合上訓練了MoLFormer-XL。本倉庫包含在10% ZINC + 10% PubChem上訓練的版本。

在訓練前，使用RDKit對分子進行規範化處理，並去除異構體信息。此外，長度超過202個標記的分子被丟棄。

硬件

16 x NVIDIA V100 GPU

評估

我們通過在MoleculeNet的11個基準任務上進行微調來評估MoLFormer。下表顯示了不同MoLFormer變體的性能：

	BBBP	HIV	BACE	SIDER	ClinTox	Tox21
10% ZINC + 10% PubChem	91.5	81.3	86.6	68.9	94.6	84.5
10% ZINC + 100% PubChem	92.2	79.2	86.3	69.0	94.7	84.5
100% ZINC	89.9	78.4	87.7	66.8	82.2	83.2
MoLFormer-Base	90.9	77.7	82.8	64.8	61.3	43.1
MoLFormer-XL	93.7	82.2	88.2	69.0	94.8	84.7

	QM9	QM8	ESOL	FreeSolv	Lipophilicity
10% ZINC + 10% PubChem	1.7754	0.0108	0.3295	0.2221	0.5472
10% ZINC + 100% PubChem	1.9093	0.0102	0.2775	0.2050	0.5331
100% ZINC	1.9403	0.0124	0.3023	0.2981	0.5440
MoLFormer-Base	2.2500	0.0111	0.2798	0.2596	0.6492
MoLFormer-XL	1.5984	0.0102	0.2787	0.2308	0.5298

我們報告了所有分類任務的AUROC、QM9/8的平均MAE以及其餘迴歸任務的RMSE。

📄 許可證

本項目採用Apache-2.0許可證。

📖 引用

@article{10.1038/s42256-022-00580-7,
  year = {2022},
  title = {{Large-scale chemical language representations capture molecular structure and   properties}},
  author = {Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and   Mroueh, Youssef and Das, Payel},
  journal = {Nature Machine Intelligence},
  doi = {10.1038/s42256-022-00580-7},
  pages = {1256--1264},
  number = {12},
  volume = {4}
}

@misc{https://doi.org/10.48550/arxiv.2106.09553,
  doi = {10.48550/ARXIV.2106.09553},
  url = {https://arxiv.org/abs/2106.09553},
  author = {Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel},
  keywords = {Machine Learning (cs.LG), Computation and Language (cs.CL), Biomolecules (q-bio.BM), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Biological sciences, FOS: Biological sciences},
  title = {Large-Scale Chemical Language Representations Capture Molecular Structure and Properties},
  publisher = {arXiv},
  year = {2021},
  copyright = {arXiv.org perpetual, non-exclusive license}
}