模型概述
模型特點
模型能力
使用案例
🚀 ProstT5模型卡片
ProstT5是一款蛋白質語言模型(pLM),能夠實現蛋白質序列與結構之間的相互轉換,為蛋白質相關研究提供了強大的工具。
🚀 快速開始
特徵提取
from transformers import T5Tokenizer, T5EncoderModel
import torch
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('Rostlab/ProstT5', do_lower_case=False).to(device)
# Load the model
model = T5EncoderModel.from_pretrained("Rostlab/ProstT5").to(device)
# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
model.full() if device=='cpu' else model.half()
# prepare your protein sequences/structures as a list. Amino acid sequences are expected to be upper-case ("PRTEINO" below) while 3Di-sequences need to be lower-case ("strctr" below).
sequence_examples = ["PRTEINO", "strct"]
# replace all rare/ambiguous amino acids by X (3Di sequences does not have those) and introduce white-space between all sequences (AAs and 3Di)
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]
# add pre-fixes accordingly (this already expects 3Di-sequences to be lower-case)
# if you go from AAs to 3Di (or if you want to embed AAs), you need to prepend "<AA2fold>"
# if you go from 3Di to AAs (or if you want to embed 3Di), you need to prepend "<fold2AA>"
sequence_examples = [ "<AA2fold>" + " " + s if s.isupper() else "<fold2AA>" + " " + s
for s in sequence_examples
]
# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequences_example, add_special_tokens=True, padding="longest",return_tensors='pt').to(device))
# generate embeddings
with torch.no_grad():
embedding_rpr = model(
ids.input_ids,
attention_mask=ids.attention_mask
)
# extract residue embeddings for the first ([0,:]) sequence in the batch and remove padded & special tokens, incl. prefix ([0,1:8])
emb_0 = embedding_repr.last_hidden_state[0,1:8] # shape (7 x 1024)
# same for the second ([1,:]) sequence but taking into account different sequence lengths ([1,:6])
emb_1 = embedding_repr.last_hidden_state[1,1:6] # shape (5 x 1024)
# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)
翻譯(“摺疊”,即從氨基酸到3Di)
from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
import torch
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('Rostlab/ProstT5', do_lower_case=False).to(device)
# Load the model
model = AutoModelForSeq2SeqLM.from_pretrained("Rostlab/ProstT5").to(device)
# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
model.full() if device=='cpu' else model.half()
# prepare your protein sequences/structures as a list.
# Amino acid sequences are expected to be upper-case ("PRTEINO" below)
# while 3Di-sequences need to be lower-case.
sequence_examples = ["PRTEINO", "SEQWENCE"]
min_len = min([ len(s) for s in folding_example])
max_len = max([ len(s) for s in folding_example])
# replace all rare/ambiguous amino acids by X (3Di sequences does not have those) and introduce white-space between all sequences (AAs and 3Di)
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]
# add pre-fixes accordingly. For the translation from AAs to 3Di, you need to prepend "<AA2fold>"
sequence_examples = [ "<AA2fold>" + " " + s for s in sequence_examples]
# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequences_example,
add_special_tokens=True,
padding="longest",
return_tensors='pt').to(device))
# Generation configuration for "folding" (AA-->3Di)
gen_kwargs_aa2fold = {
"do_sample": True,
"num_beams": 3,
"top_p" : 0.95,
"temperature" : 1.2,
"top_k" : 6,
"repetition_penalty" : 1.2,
}
# translate from AA to 3Di (AA-->3Di)
with torch.no_grad():
translations = model.generate(
ids.input_ids,
attention_mask=ids.attention_mask,
max_length=max_len, # max length of generated text
min_length=min_len, # minimum length of the generated text
early_stopping=True, # stop early if end-of-text token is generated
num_return_sequences=1, # return only a single sequence
**gen_kwargs_aa2fold
)
# Decode and remove white-spaces between tokens
decoded_translations = tokenizer.batch_decode( translations, skip_special_tokens=True )
structure_sequences = [ "".join(ts.split(" ")) for ts in decoded_translations ] # predicted 3Di strings
# Now we can use the same model and invert the translation logic
# to generate an amino acid sequence from the predicted 3Di-sequence (3Di-->AA)
# add pre-fixes accordingly. For the translation from 3Di to AA (3Di-->AA), you need to prepend "<fold2AA>"
sequence_examples_backtranslation = [ "<fold2AA>" + " " + s for s in decoded_translations]
# tokenize sequences and pad up to the longest sequence in the batch
ids_backtranslation = tokenizer.batch_encode_plus(sequence_examples_backtranslation,
add_special_tokens=True,
padding="longest",
return_tensors='pt').to(device))
# Example generation configuration for "inverse folding" (3Di-->AA)
gen_kwargs_fold2AA = {
"do_sample": True,
"top_p" : 0.90,
"temperature" : 1.1,
"top_k" : 6,
"repetition_penalty" : 1.2,
}
# translate from 3Di to AA (3Di-->AA)
with torch.no_grad():
backtranslations = model.generate(
ids_backtranslation.input_ids,
attention_mask=ids_backtranslation.attention_mask,
max_length=max_len, # max length of generated text
min_length=min_len, # minimum length of the generated text
early_stopping=True, # stop early if end-of-text token is generated
num_return_sequences=1, # return only a single sequence
**gen_kwargs_fold2AA
)
# Decode and remove white-spaces between tokens
decoded_backtranslations = tokenizer.batch_decode( backtranslations, skip_special_tokens=True )
aminoAcid_sequences = [ "".join(ts.split(" ")) for ts in decoded_backtranslations ] # predicted amino acid strings
✨ 主要特性
- 跨模態轉換:能夠實現蛋白質序列與結構之間的相互轉換,為蛋白質研究提供了新的視角。
- 特徵提取:可用於傳統的特徵提取,且相比原模型,還能對由3Di令牌表示的3D結構進行嵌入。
- 摺疊與反摺疊:支持從序列到結構的“摺疊”以及從結構到序列的“反摺疊”操作。
📚 詳細文檔
模型詳情
模型描述
ProstT5(蛋白質結構 - 序列T5)基於ProtT5-XL-U50構建,這是一個在數十億蛋白質序列上應用跨度損壞技術進行蛋白質序列編碼訓練的T5模型。ProstT5在來自AlphaFoldDB的1700萬個具有高質量3D結構預測的蛋白質上對ProtT5-XL-U50進行微調,以實現蛋白質序列與結構之間的轉換。蛋白質結構通過Foldseek引入的3Di令牌從3D轉換為1D。
在第一步,ProstT5通過繼續對3Di和氨基酸(AA)序列應用原始的跨度去噪目標,學習表示新引入的3Di令牌。僅在第二步,ProstT5才進行兩種模態之間的轉換訓練。轉換方向由兩個特殊令牌表示(“
- 開發者:Michael Heinzinger(GitHub @mheinzinger;Twitter @HeinzingerM)
- 模型類型:編碼器 - 解碼器(T5)
- 語言(NLP):蛋白質序列和結構
- 許可證:MIT
- 微調基礎模型:ProtT5-XL-U50
用途
- 特徵提取:該模型可用於傳統的特徵提取。為此,我們建議僅使用編碼器,採用半精度(fp16)並結合批處理。示例(目前僅適用於原始的ProtT5-XL-U50,但替換存儲庫鏈接並添加前綴即可使用):腳本和Colab。與原始的ProtT5-XL-U50只能嵌入AA序列不同,ProstT5現在還能嵌入由3Di令牌表示的3D結構。3Di令牌可以通過Foldseek從3D結構派生,也可以由ProstT5從AA序列預測得到。
- “摺疊”:從序列(AA)到結構(3Di)的轉換。得到的3Di字符串可與Foldseek一起用於遠程同源性檢測,同時避免顯式計算3D結構。
- “反摺疊”:從結構(3Di)到序列(AA)的轉換。
訓練詳情
訓練數據
訓練過程
預訓練的第一階段使用此腳本繼續對3Di和AA序列進行基於跨度的去噪。預訓練的第二階段(即從3Di到AA序列的實際轉換以及反之),我們使用此腳本。
訓練超參數
- 訓練機制:我們使用了DeepSpeed(階段2)、梯度累積步驟(5步)、混合半精度(bf16)和PyTorch2.0的torchInductor編譯器。
速度
在配備48GB顯存的單個RTX A6000 GPU上,使用批處理和半精度(fp16),從Pro(s)tT5編碼器為人類蛋白質組生成嵌入大約需要35分鐘,即每個蛋白質約0.1秒。由於解碼過程需要從左到右逐個令牌生成,具有順序性,因此翻譯相對較慢(平均長度分別為135和406時,每個蛋白質為0.6 - 2.5秒)。我們僅使用了批處理和半精度,未進行進一步優化。
🔧 技術細節
模型架構
基於T5的編碼器 - 解碼器架構,通過微調實現蛋白質序列與結構的轉換。
訓練策略
分階段訓練,先學習3Di令牌表示,再進行跨模態轉換訓練。
性能指標
在特徵提取和跨模態轉換任務上表現良好,但翻譯速度受解碼過程順序性影響。
📄 許可證
本模型採用MIT許可證。











