ProstT5開源蛋白質語言模型 - 免費實現蛋白質序列與結構翻譯

首頁

Prostt5

由Rostlab開發

ProstT5是一種蛋白質語言模型，能夠在蛋白質序列與結構之間進行翻譯。

蛋白質模型

Transformers

開源協議:MIT #蛋白質序列結構翻譯 #3Di標記嵌入 #遠程同源檢測

下載量 252.91k

發布時間 : 7/21/2023

模型概述

ProstT5（蛋白質結構序列T5）基於ProtT5-XL-U50，通過微調實現了蛋白質序列與3D結構之間的雙向翻譯。它支持從氨基酸序列預測3D結構（摺疊）和從3D結構生成氨基酸序列（逆折疊）。

模型特點

雙向翻譯能力

支持蛋白質序列（AA）與結構（3Di）之間的雙向翻譯，包括摺疊（AA→3Di）和逆折疊（3Di→AA）

基於ProtT5-XL-U50微調

在1700萬高質量3D結構預測蛋白質上微調，繼承了ProtT5-XL-U50的強大表示能力

結構特徵提取

能夠從3Di標記表示的3D結構中提取特徵，擴展了傳統蛋白質語言模型的功能

模型能力

蛋白質序列到結構翻譯

蛋白質結構到序列翻譯

蛋白質序列特徵提取

蛋白質結構特徵提取

使用案例

生物信息學

遠程同源檢測

通過預測的3Di字符串與Foldseek結合，無需顯式計算3D結構即可進行遠程同源檢測

蛋白質設計

通過逆折疊從3D結構生成可能的氨基酸序列，輔助蛋白質設計

計算生物學

蛋白質結構預測

從氨基酸序列預測3D結構的簡化表示（3Di標記）

🚀 ProstT5模型卡片

ProstT5是一款蛋白質語言模型（pLM），能夠實現蛋白質序列與結構之間的相互轉換，為蛋白質相關研究提供了強大的工具。

🚀 快速開始

特徵提取

from transformers import T5Tokenizer, T5EncoderModel
import torch
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('Rostlab/ProstT5', do_lower_case=False).to(device)

# Load the model
model = T5EncoderModel.from_pretrained("Rostlab/ProstT5").to(device)

# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
model.full() if device=='cpu' else model.half()

# prepare your protein sequences/structures as a list. Amino acid sequences are expected to be upper-case ("PRTEINO" below) while 3Di-sequences need to be lower-case ("strctr" below).
sequence_examples = ["PRTEINO", "strct"]

# replace all rare/ambiguous amino acids by X (3Di sequences does not have those) and introduce white-space between all sequences (AAs and 3Di)
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# add pre-fixes accordingly (this already expects 3Di-sequences to be lower-case)
# if you go from AAs to 3Di (or if you want to embed AAs), you need to prepend "<AA2fold>"
# if you go from 3Di to AAs (or if you want to embed 3Di), you need to prepend "<fold2AA>"
sequence_examples = [ "<AA2fold>" + " " + s if s.isupper() else "<fold2AA>" + " " + s
                      for s in sequence_examples
                    ]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequences_example, add_special_tokens=True, padding="longest",return_tensors='pt').to(device))

# generate embeddings
with torch.no_grad():
    embedding_rpr = model(
              ids.input_ids, 
              attention_mask=ids.attention_mask
              )

# extract residue embeddings for the first ([0,:]) sequence in the batch and remove padded & special tokens, incl. prefix ([0,1:8]) 
emb_0 = embedding_repr.last_hidden_state[0,1:8] # shape (7 x 1024)
# same for the second ([1,:]) sequence but taking into account different sequence lengths ([1,:6])
emb_1 = embedding_repr.last_hidden_state[1,1:6] # shape (5 x 1024)

# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)

翻譯（“摺疊”，即從氨基酸到3Di）

from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
import torch
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('Rostlab/ProstT5', do_lower_case=False).to(device)

# Load the model
model = AutoModelForSeq2SeqLM.from_pretrained("Rostlab/ProstT5").to(device)

# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
model.full() if device=='cpu' else model.half()

# prepare your protein sequences/structures as a list.
# Amino acid sequences are expected to be upper-case ("PRTEINO" below)
# while 3Di-sequences need to be lower-case.
sequence_examples = ["PRTEINO", "SEQWENCE"]
min_len = min([ len(s) for s in folding_example])
max_len = max([ len(s) for s in folding_example])

# replace all rare/ambiguous amino acids by X (3Di sequences does not have those) and introduce white-space between all sequences (AAs and 3Di)
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# add pre-fixes accordingly. For the translation from AAs to 3Di, you need to prepend "<AA2fold>"
sequence_examples = [ "<AA2fold>" + " " + s for s in sequence_examples]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequences_example,
                                  add_special_tokens=True,
                                  padding="longest",
                                  return_tensors='pt').to(device))

# Generation configuration for "folding" (AA-->3Di)
gen_kwargs_aa2fold = {
                  "do_sample": True,
                  "num_beams": 3, 
                  "top_p" : 0.95, 
                  "temperature" : 1.2, 
                  "top_k" : 6,
                  "repetition_penalty" : 1.2,
}

# translate from AA to 3Di (AA-->3Di)
with torch.no_grad():
  translations = model.generate( 
              ids.input_ids, 
              attention_mask=ids.attention_mask, 
              max_length=max_len, # max length of generated text
              min_length=min_len, # minimum length of the generated text
              early_stopping=True, # stop early if end-of-text token is generated
              num_return_sequences=1, # return only a single sequence
              **gen_kwargs_aa2fold
  )
# Decode and remove white-spaces between tokens
decoded_translations = tokenizer.batch_decode( translations, skip_special_tokens=True )
structure_sequences = [ "".join(ts.split(" ")) for ts in decoded_translations ] # predicted 3Di strings

# Now we can use the same model and invert the translation logic
# to generate an amino acid sequence from the predicted 3Di-sequence (3Di-->AA)

# add pre-fixes accordingly. For the translation from 3Di to AA (3Di-->AA), you need to prepend "<fold2AA>"
sequence_examples_backtranslation = [ "<fold2AA>" + " " + s for s in decoded_translations]

# tokenize sequences and pad up to the longest sequence in the batch
ids_backtranslation = tokenizer.batch_encode_plus(sequence_examples_backtranslation,
                                  add_special_tokens=True,
                                  padding="longest",
                                  return_tensors='pt').to(device))

# Example generation configuration for "inverse folding" (3Di-->AA)
gen_kwargs_fold2AA = {
            "do_sample": True,
            "top_p" : 0.90,
            "temperature" : 1.1,
            "top_k" : 6,
            "repetition_penalty" : 1.2,
}

# translate from 3Di to AA (3Di-->AA)
with torch.no_grad():
  backtranslations = model.generate( 
              ids_backtranslation.input_ids, 
              attention_mask=ids_backtranslation.attention_mask, 
              max_length=max_len, # max length of generated text
              min_length=min_len, # minimum length of the generated text
              early_stopping=True, # stop early if end-of-text token is generated
              num_return_sequences=1, # return only a single sequence
              **gen_kwargs_fold2AA
  )
# Decode and remove white-spaces between tokens
decoded_backtranslations = tokenizer.batch_decode( backtranslations, skip_special_tokens=True )
aminoAcid_sequences = [ "".join(ts.split(" ")) for ts in decoded_backtranslations ] # predicted amino acid strings

✨ 主要特性

跨模態轉換：能夠實現蛋白質序列與結構之間的相互轉換，為蛋白質研究提供了新的視角。
特徵提取：可用於傳統的特徵提取，且相比原模型，還能對由3Di令牌表示的3D結構進行嵌入。
摺疊與反摺疊：支持從序列到結構的“摺疊”以及從結構到序列的“反摺疊”操作。

📚 詳細文檔

模型詳情

模型描述

ProstT5（蛋白質結構 - 序列T5）基於ProtT5-XL-U50構建，這是一個在數十億蛋白質序列上應用跨度損壞技術進行蛋白質序列編碼訓練的T5模型。ProstT5在來自AlphaFoldDB的1700萬個具有高質量3D結構預測的蛋白質上對ProtT5-XL-U50進行微調，以實現蛋白質序列與結構之間的轉換。蛋白質結構通過Foldseek引入的3Di令牌從3D轉換為1D。

在第一步，ProstT5通過繼續對3Di和氨基酸（AA）序列應用原始的跨度去噪目標，學習表示新引入的3Di令牌。僅在第二步，ProstT5才進行兩種模態之間的轉換訓練。轉換方向由兩個特殊令牌表示（“”用於從3Di轉換為AA，“”用於從AA轉換為3Di）。為避免與AA令牌衝突，3Di令牌轉換為小寫（否則字母相同）。

開發者：Michael Heinzinger（GitHub @mheinzinger；Twitter @HeinzingerM）
模型類型：編碼器 - 解碼器（T5）
語言（NLP）：蛋白質序列和結構
許可證：MIT
微調基礎模型：ProtT5-XL-U50

用途

特徵提取：該模型可用於傳統的特徵提取。為此，我們建議僅使用編碼器，採用半精度（fp16）並結合批處理。示例（目前僅適用於原始的ProtT5-XL-U50，但替換存儲庫鏈接並添加前綴即可使用）：腳本和Colab。與原始的ProtT5-XL-U50只能嵌入AA序列不同，ProstT5現在還能嵌入由3Di令牌表示的3D結構。3Di令牌可以通過Foldseek從3D結構派生，也可以由ProstT5從AA序列預測得到。
“摺疊”：從序列（AA）到結構（3Di）的轉換。得到的3Di字符串可與Foldseek一起用於遠程同源性檢測，同時避免顯式計算3D結構。
“反摺疊”：從結構（3Di）到序列（AA）的轉換。

訓練詳情

訓練數據

預訓練數據（1700萬個蛋白質的3Di + AA序列）

訓練過程

預訓練的第一階段使用此腳本繼續對3Di和AA序列進行基於跨度的去噪。預訓練的第二階段（即從3Di到AA序列的實際轉換以及反之），我們使用此腳本。

訓練超參數

訓練機制：我們使用了DeepSpeed（階段2）、梯度累積步驟（5步）、混合半精度（bf16）和PyTorch2.0的torchInductor編譯器。

速度

在配備48GB顯存的單個RTX A6000 GPU上，使用批處理和半精度（fp16），從Pro(s)tT5編碼器為人類蛋白質組生成嵌入大約需要35分鐘，即每個蛋白質約0.1秒。由於解碼過程需要從左到右逐個令牌生成，具有順序性，因此翻譯相對較慢（平均長度分別為135和406時，每個蛋白質為0.6 - 2.5秒）。我們僅使用了批處理和半精度，未進行進一步優化。