prot_t5_xl_uniref50開源蛋白質序列模型 - 捕捉生物物理特性助力蛋白質研究

首頁

Prot T5 Xl Uniref50

由Rostlab開發

基於T5-3B架構的蛋白質序列預訓練模型，通過自監督學習捕捉蛋白質的生物物理特性

蛋白質模型

Transformers

#蛋白質序列嵌入 #自監督預訓練 #殘基級特徵提取

下載量 78.45k

發布時間 : 3/2/2022

模型概述

該模型採用掩碼語言建模目標在UniRef50數據集上預訓練，能夠從蛋白質序列中提取有意義的生物特徵表示，適用於蛋白質結構預測和功能分析等任務

模型特點

大規模預訓練

在包含4500萬條蛋白質序列的UniRef50數據集上進行預訓練

生物物理特性捕捉

模型學習到的特徵能夠反映決定蛋白質三維構象的重要生物物理特性

雙用途設計

既支持直接特徵提取，也可針對特定下游任務進行微調

高效掩碼策略

採用15%氨基酸隨機掩碼策略，其中90%替換為[MASK]，10%替換為隨機氨基酸

模型能力

蛋白質序列特徵提取

蛋白質二級結構預測

亞細胞定位預測

膜蛋白檢測

蛋白質功能預測

使用案例

結構生物學

蛋白質二級結構預測

預測蛋白質的3態或8態二級結構

在CASP12數據集上達到81%準確率(3態)

細胞生物學

亞細胞定位預測

預測蛋白質在細胞內的定位位置

在DeepLoc數據集上達到81%準確率

膜蛋白檢測

區分膜結合蛋白與水溶性蛋白

在DeepLoc數據集上達到91%準確率

🚀 ProtT5-XL-UniRef50模型

ProtT5-XL-UniRef50是一個基於蛋白質序列預訓練的模型，採用掩碼語言模型（MLM）目標。它能夠捕捉蛋白質序列中的重要生物物理特性，可用於蛋白質特徵提取或下游任務微調。

🚀 快速開始

ProtT5-XL-UniRef50是基於T5-3B模型，以自監督的方式在大量蛋白質序列語料庫上進行預訓練的。以下是使用該模型提取給定蛋白質序列特徵的PyTorch代碼示例：

sequence_examples = ["PRTEINO", "SEQWENCE"]
# this will replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

# generate embeddings
with torch.no_grad():
    embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask)

# extract embeddings for the first ([0,:]) sequence in the batch while removing padded & special tokens ([0,:7]) 
emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
print(f"Shape of per-residue embedding of first sequences: {emb_0.shape}")
# do the same for the second ([1,:]) sequence in the batch while taking into account different sequence lengths ([1,:8])
emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)

# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)

print(f"Shape of per-protein embedding of first sequences: {emb_0_per_protein.shape}")

✨ 主要特性

自監督預訓練：在大量蛋白質序列上進行自監督預訓練，無需人工標註，可利用大量公開數據。
特殊的去噪目標：與原始T5模型不同，採用類似Bart的MLM去噪目標。
捕捉生物物理特性：提取的特徵（LM嵌入）能夠捕捉控制蛋白質形狀的重要生物物理特性。

📦 安裝指南

文檔未提供安裝步驟，故跳過該章節。

💻 使用示例

基礎用法

sequence_examples = ["PRTEINO", "SEQWENCE"]
# this will replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

# generate embeddings
with torch.no_grad():
    embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask)

# extract embeddings for the first ([0,:]) sequence in the batch while removing padded & special tokens ([0,:7]) 
emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
print(f"Shape of per-residue embedding of first sequences: {emb_0.shape}")
# do the same for the second ([1,:]) sequence in the batch while taking into account different sequence lengths ([1,:8])
emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)

# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)

print(f"Shape of per-protein embedding of first sequences: {emb_0_per_protein.shape}")

高級用法

文檔未提供高級用法示例，故跳過該部分。

📚 詳細文檔

模型描述

ProtT5-XL-UniRef50基於T5-3B模型，以自監督的方式在大量蛋白質序列語料庫上進行預訓練。與原始T5-3B模型不同，它採用類似Bart的MLM去噪目標，隨機掩碼輸入中15%的氨基酸。研究表明，從該自監督模型提取的特徵（LM嵌入）能夠捕捉控制蛋白質形狀的重要生物物理特性，這意味著模型學習到了蛋白質序列中生命語言的部分語法。

預期用途和限制

該模型可用於蛋白質特徵提取或下游任務微調。在某些任務中，微調模型比直接使用它作為特徵提取器能獲得更高的準確率。此外，對於特徵提取，使用編碼器提取的特徵比解碼器的效果更好。

訓練數據

該模型在UniRef50數據集上進行預訓練，該數據集包含4500萬個蛋白質序列。

訓練過程

預處理

將蛋白質序列轉換為大寫，並使用單個空格進行分詞，詞彙表大小為21。將罕見氨基酸“U,Z,O,B”映射為“X”。模型的輸入形式為：

Protein Sequence [EOS]

預處理步驟即時進行，將蛋白質序列裁剪和填充至最長512個標記。每個序列的掩碼過程細節如下：

15%的氨基酸被掩碼。
90%的情況下，被掩碼的氨基酸被[MASK]標記替換。
10%的情況下，被掩碼的氨基酸被替換為與原氨基酸不同的隨機氨基酸。

預訓練

模型在單個TPU Pod V2-256上總共訓練了991,500步，使用序列長度512（批量大小2k）。以ProtT5-XL-BFD模型作為初始檢查點，而不是從頭開始訓練。模型共有約30億個參數，採用編碼器 - 解碼器架構。預訓練使用的優化器是AdaFactor，學習率採用逆平方根調度。

評估結果

當模型用於特徵提取時，取得了以下結果：

任務/數據集	二級結構（3狀態）	二級結構（8狀態）	亞細胞定位	膜蛋白預測
CASP12	81	70
TS115	87	77
CB513	86	74
DeepLoc			81	91

🔧 技術細節

該模型基於T5-3B架構，採用編碼器 - 解碼器結構，共有約30億個參數。在預訓練過程中，使用AdaFactor優化器和逆平方根學習率調度。特殊的去噪目標（類似Bart的MLM）和掩碼策略有助於模型學習蛋白質序列的特徵。

📄 許可證

文檔未提供許可證信息，故跳過該章節。

BibTeX引用

@article {Elnaggar2020.07.12.199554,
	author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
	title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
	elocation-id = {2020.07.12.199554},
	year = {2020},
	doi = {10.1101/2020.07.12.199554},
	publisher = {Cold Spring Harbor Laboratory},
	abstract = {Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112 times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8 states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs. Availability ProtTrans: \&lt;a href="https://github.com/agemagician/ProtTrans"\&gt;https://github.com/agemagician/ProtTrans\&lt;/a\&gt;Competing Interest StatementThe authors have declared no competing interest.},
	URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
	eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
	journal = {bioRxiv}
}