模型概述
模型特點
模型能力
使用案例
🚀 ProtT5-XL-UniRef50模型
ProtT5-XL-UniRef50是一個基於蛋白質序列預訓練的模型,採用掩碼語言模型(MLM)目標。它能夠捕捉蛋白質序列中的重要生物物理特性,可用於蛋白質特徵提取或下游任務微調。
🚀 快速開始
ProtT5-XL-UniRef50是基於T5-3B
模型,以自監督的方式在大量蛋白質序列語料庫上進行預訓練的。以下是使用該模型提取給定蛋白質序列特徵的PyTorch代碼示例:
sequence_examples = ["PRTEINO", "SEQWENCE"]
# this will replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]
# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)
# generate embeddings
with torch.no_grad():
embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask)
# extract embeddings for the first ([0,:]) sequence in the batch while removing padded & special tokens ([0,:7])
emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
print(f"Shape of per-residue embedding of first sequences: {emb_0.shape}")
# do the same for the second ([1,:]) sequence in the batch while taking into account different sequence lengths ([1,:8])
emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)
# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)
print(f"Shape of per-protein embedding of first sequences: {emb_0_per_protein.shape}")
✨ 主要特性
- 自監督預訓練:在大量蛋白質序列上進行自監督預訓練,無需人工標註,可利用大量公開數據。
- 特殊的去噪目標:與原始T5模型不同,採用類似Bart的MLM去噪目標。
- 捕捉生物物理特性:提取的特徵(LM嵌入)能夠捕捉控制蛋白質形狀的重要生物物理特性。
📦 安裝指南
文檔未提供安裝步驟,故跳過該章節。
💻 使用示例
基礎用法
sequence_examples = ["PRTEINO", "SEQWENCE"]
# this will replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]
# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)
# generate embeddings
with torch.no_grad():
embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask)
# extract embeddings for the first ([0,:]) sequence in the batch while removing padded & special tokens ([0,:7])
emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
print(f"Shape of per-residue embedding of first sequences: {emb_0.shape}")
# do the same for the second ([1,:]) sequence in the batch while taking into account different sequence lengths ([1,:8])
emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)
# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)
print(f"Shape of per-protein embedding of first sequences: {emb_0_per_protein.shape}")
高級用法
文檔未提供高級用法示例,故跳過該部分。
📚 詳細文檔
模型描述
ProtT5-XL-UniRef50基於T5-3B
模型,以自監督的方式在大量蛋白質序列語料庫上進行預訓練。與原始T5-3B模型不同,它採用類似Bart的MLM去噪目標,隨機掩碼輸入中15%的氨基酸。研究表明,從該自監督模型提取的特徵(LM嵌入)能夠捕捉控制蛋白質形狀的重要生物物理特性,這意味著模型學習到了蛋白質序列中生命語言的部分語法。
預期用途和限制
該模型可用於蛋白質特徵提取或下游任務微調。在某些任務中,微調模型比直接使用它作為特徵提取器能獲得更高的準確率。此外,對於特徵提取,使用編碼器提取的特徵比解碼器的效果更好。
訓練數據
該模型在UniRef50數據集上進行預訓練,該數據集包含4500萬個蛋白質序列。
訓練過程
預處理
將蛋白質序列轉換為大寫,並使用單個空格進行分詞,詞彙表大小為21。將罕見氨基酸“U,Z,O,B”映射為“X”。模型的輸入形式為:
Protein Sequence [EOS]
預處理步驟即時進行,將蛋白質序列裁剪和填充至最長512個標記。每個序列的掩碼過程細節如下:
- 15%的氨基酸被掩碼。
- 90%的情況下,被掩碼的氨基酸被
[MASK]
標記替換。 - 10%的情況下,被掩碼的氨基酸被替換為與原氨基酸不同的隨機氨基酸。
預訓練
模型在單個TPU Pod V2-256上總共訓練了991,500步,使用序列長度512(批量大小2k)。以ProtT5-XL-BFD模型作為初始檢查點,而不是從頭開始訓練。模型共有約30億個參數,採用編碼器 - 解碼器架構。預訓練使用的優化器是AdaFactor,學習率採用逆平方根調度。
評估結果
當模型用於特徵提取時,取得了以下結果:
任務/數據集 | 二級結構(3狀態) | 二級結構(8狀態) | 亞細胞定位 | 膜蛋白預測 |
---|---|---|---|---|
CASP12 | 81 | 70 | ||
TS115 | 87 | 77 | ||
CB513 | 86 | 74 | ||
DeepLoc | 81 | 91 |
🔧 技術細節
該模型基於T5-3B
架構,採用編碼器 - 解碼器結構,共有約30億個參數。在預訓練過程中,使用AdaFactor優化器和逆平方根學習率調度。特殊的去噪目標(類似Bart的MLM)和掩碼策略有助於模型學習蛋白質序列的特徵。
📄 許可證
文檔未提供許可證信息,故跳過該章節。
BibTeX引用
@article {Elnaggar2020.07.12.199554,
author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
elocation-id = {2020.07.12.199554},
year = {2020},
doi = {10.1101/2020.07.12.199554},
publisher = {Cold Spring Harbor Laboratory},
abstract = {Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112 times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8 states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs. Availability ProtTrans: \<a href="https://github.com/agemagician/ProtTrans"\>https://github.com/agemagician/ProtTrans\</a\>Competing Interest StatementThe authors have declared no competing interest.},
URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
journal = {bioRxiv}
}
Created by Ahmed Elnaggar/@Elnaggar_AI | LinkedIn











