prot_t5_xl_uniref50开源蛋白质序列模型 - 捕捉生物物理特性助力蛋白质研究

首页

Prot T5 Xl Uniref50

由 Rostlab 开发

基于T5-3B架构的蛋白质序列预训练模型，通过自监督学习捕捉蛋白质的生物物理特性

蛋白质模型

Transformers

#蛋白质序列嵌入 #自监督预训练 #残基级特征提取

下载量 78.45k

发布时间 : 3/2/2022

模型简介

该模型采用掩码语言建模目标在UniRef50数据集上预训练，能够从蛋白质序列中提取有意义的生物特征表示，适用于蛋白质结构预测和功能分析等任务

模型特点

大规模预训练

在包含4500万条蛋白质序列的UniRef50数据集上进行预训练

生物物理特性捕捉

模型学习到的特征能够反映决定蛋白质三维构象的重要生物物理特性

双用途设计

既支持直接特征提取，也可针对特定下游任务进行微调

高效掩码策略

采用15%氨基酸随机掩码策略，其中90%替换为[MASK]，10%替换为随机氨基酸

模型能力

蛋白质序列特征提取

蛋白质二级结构预测

亚细胞定位预测

膜蛋白检测

蛋白质功能预测

使用案例

结构生物学

蛋白质二级结构预测

预测蛋白质的3态或8态二级结构

在CASP12数据集上达到81%准确率(3态)

细胞生物学

亚细胞定位预测

预测蛋白质在细胞内的定位位置

在DeepLoc数据集上达到81%准确率

膜蛋白检测

区分膜结合蛋白与水溶性蛋白

在DeepLoc数据集上达到91%准确率

🚀 ProtT5-XL-UniRef50模型

ProtT5-XL-UniRef50是一个基于蛋白质序列预训练的模型，采用掩码语言模型（MLM）目标。它能够捕捉蛋白质序列中的重要生物物理特性，可用于蛋白质特征提取或下游任务微调。

🚀 快速开始

ProtT5-XL-UniRef50是基于T5-3B模型，以自监督的方式在大量蛋白质序列语料库上进行预训练的。以下是使用该模型提取给定蛋白质序列特征的PyTorch代码示例：

sequence_examples = ["PRTEINO", "SEQWENCE"]
# this will replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

# generate embeddings
with torch.no_grad():
    embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask)

# extract embeddings for the first ([0,:]) sequence in the batch while removing padded & special tokens ([0,:7]) 
emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
print(f"Shape of per-residue embedding of first sequences: {emb_0.shape}")
# do the same for the second ([1,:]) sequence in the batch while taking into account different sequence lengths ([1,:8])
emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)

# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)

print(f"Shape of per-protein embedding of first sequences: {emb_0_per_protein.shape}")

✨ 主要特性

自监督预训练：在大量蛋白质序列上进行自监督预训练，无需人工标注，可利用大量公开数据。
特殊的去噪目标：与原始T5模型不同，采用类似Bart的MLM去噪目标。
捕捉生物物理特性：提取的特征（LM嵌入）能够捕捉控制蛋白质形状的重要生物物理特性。

📦 安装指南

文档未提供安装步骤，故跳过该章节。

💻 使用示例

基础用法

sequence_examples = ["PRTEINO", "SEQWENCE"]
# this will replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

# generate embeddings
with torch.no_grad():
    embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask)

# extract embeddings for the first ([0,:]) sequence in the batch while removing padded & special tokens ([0,:7]) 
emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
print(f"Shape of per-residue embedding of first sequences: {emb_0.shape}")
# do the same for the second ([1,:]) sequence in the batch while taking into account different sequence lengths ([1,:8])
emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)

# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)

print(f"Shape of per-protein embedding of first sequences: {emb_0_per_protein.shape}")

高级用法

文档未提供高级用法示例，故跳过该部分。

📚 详细文档

模型描述

ProtT5-XL-UniRef50基于T5-3B模型，以自监督的方式在大量蛋白质序列语料库上进行预训练。与原始T5-3B模型不同，它采用类似Bart的MLM去噪目标，随机掩码输入中15%的氨基酸。研究表明，从该自监督模型提取的特征（LM嵌入）能够捕捉控制蛋白质形状的重要生物物理特性，这意味着模型学习到了蛋白质序列中生命语言的部分语法。

预期用途和限制

该模型可用于蛋白质特征提取或下游任务微调。在某些任务中，微调模型比直接使用它作为特征提取器能获得更高的准确率。此外，对于特征提取，使用编码器提取的特征比解码器的效果更好。

训练数据

该模型在UniRef50数据集上进行预训练，该数据集包含4500万个蛋白质序列。

训练过程

预处理

将蛋白质序列转换为大写，并使用单个空格进行分词，词汇表大小为21。将罕见氨基酸“U,Z,O,B”映射为“X”。模型的输入形式为：

Protein Sequence [EOS]

预处理步骤实时进行，将蛋白质序列裁剪和填充至最长512个标记。每个序列的掩码过程细节如下：

15%的氨基酸被掩码。
90%的情况下，被掩码的氨基酸被[MASK]标记替换。
10%的情况下，被掩码的氨基酸被替换为与原氨基酸不同的随机氨基酸。

预训练

模型在单个TPU Pod V2-256上总共训练了991,500步，使用序列长度512（批量大小2k）。以ProtT5-XL-BFD模型作为初始检查点，而不是从头开始训练。模型共有约30亿个参数，采用编码器 - 解码器架构。预训练使用的优化器是AdaFactor，学习率采用逆平方根调度。

评估结果

当模型用于特征提取时，取得了以下结果：

任务/数据集	二级结构（3状态）	二级结构（8状态）	亚细胞定位	膜蛋白预测
CASP12	81	70
TS115	87	77
CB513	86	74
DeepLoc			81	91

🔧 技术细节

该模型基于T5-3B架构，采用编码器 - 解码器结构，共有约30亿个参数。在预训练过程中，使用AdaFactor优化器和逆平方根学习率调度。特殊的去噪目标（类似Bart的MLM）和掩码策略有助于模型学习蛋白质序列的特征。

📄 许可证

文档未提供许可证信息，故跳过该章节。

BibTeX引用

@article {Elnaggar2020.07.12.199554,
	author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
	title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
	elocation-id = {2020.07.12.199554},
	year = {2020},
	doi = {10.1101/2020.07.12.199554},
	publisher = {Cold Spring Harbor Laboratory},
	abstract = {Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112 times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8 states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs. Availability ProtTrans: \&lt;a href="https://github.com/agemagician/ProtTrans"\&gt;https://github.com/agemagician/ProtTrans\&lt;/a\&gt;Competing Interest StatementThe authors have declared no competing interest.},
	URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
	eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
	journal = {bioRxiv}
}