ProstT5オープンソースタンパク質言語モデル - タンパク質配列と構造の翻訳を無料で実現

ホーム

Prostt5

Rostlabによって開発

ProstT5は、タンパク質配列と構造の間で翻訳を行うことができるタンパク質言語モデルです。

タンパク質モデル

Transformers

オープンソースライセンス:MIT #タンパク質配列構造翻訳 #3Diマーカー埋め込み #遠縁相同性検出

ダウンロード数 252.91k

リリース時間 : 7/21/2023

モデル概要

ProstT5（タンパク質構造配列T5）はProtT5-XL-U50を基に、タンパク質配列と3D構造の間の双方向翻訳を実現するために微調整されました。アミノ酸配列から3D構造（フォールディング）を予測し、3D構造からアミノ酸配列（逆フォールディング）を生成することができます。

モデル特徴

双方向翻訳能力

タンパク質配列（AA）と構造（3Di）の間の双方向翻訳をサポートし、フォールディング（AA→3Di）と逆フォールディング（3Di→AA）を含む

ProtT5-XL-U50に基づく微調整

1700万の高品質な3D構造予測タンパク質で微調整され、ProtT5-XL-U50の強力な表現能力を継承

構造特徴抽出

3Diマーカーで表現された3D構造から特徴を抽出でき、従来のタンパク質言語モデルの機能を拡張

モデル能力

タンパク質配列から構造への翻訳

タンパク質構造から配列への翻訳

タンパク質配列特徴抽出

タンパク質構造特徴抽出

使用事例

バイオインフォマティクス

遠縁相同性検出

予測された3Di文字列とFoldseekを組み合わせることで、明示的に3D構造を計算せずに遠縁相同性検出が可能

タンパク質設計

逆フォールディングにより3D構造から可能なアミノ酸配列を生成し、タンパク質設計を支援

計算生物学

タンパク質構造予測

アミノ酸配列から3D構造の簡略化された表現（3Diマーカー）を予測

🚀 ProstT5モデルカード

ProstT5は、タンパク質配列と構造の間の翻訳を行うことができるタンパク質言語モデル（pLM）です。このモデルは、タンパク質の構造と配列の関係を理解し、相互変換する能力を持っています。

ProstT5 pre-training and inference

📚 詳細ドキュメント

モデルの説明

ProstT5（Protein structure - sequence T5）は、ProtT5 - XL - U50をベースにしています。このT5モデルは、何十億ものタンパク質配列に対してスパン破損を適用してタンパク質配列をエンコードするように学習されています。

ProstT5は、AlphaFoldDBからの高品質な3D構造予測を持つ1700万個のタンパク質を使用して、タンパク質配列と構造の間の翻訳に関してProtT5 - XL - U50を微調整します。

タンパク質構造は、Foldseekによって導入された3Diトークンを使用して3Dから1Dに変換されます。最初に、ProstT5は、3Diおよびアミノ酸（AA）配列に適用される元のスパンノイズ除去の目的を継続することで、新しく導入された3Diトークンを表現するように学習します。2番目のステップでのみ、ProstT5は2つのモダリティ間の翻訳について学習されます。

翻訳の方向は、2つの特殊トークン（3DiからAAへの翻訳には"<fold2AA>"、AAから3Diへの翻訳には“<AA2fold>”）によって示されます。AAトークンとの衝突を避けるために、3Diトークンは小文字に変換されます（それ以外のアルファベットは同じです）。

開発者: Michael Heinzinger (GitHub @mheinzinger; Twitter @HeinzingerM)
モデルの種類: エンコーダ - デコーダ（T5）
言語 (NLP): タンパク質配列と構造
ライセンス: MIT
微調整元のモデル: ProtT5 - XL - U50

✨ 主な機能

このモデルは、従来の特徴抽出に使用することができます。この目的のために、半精度（fp16）でバッチ処理を行うエンコーダのみを使用することをおすすめします。例（現在は元のProtT5 - XL - U50のみですが、リポジトリリンクを置き換えて接頭辞を追加することで動作します）: スクリプトと [colab](https://colab.research.google.com/drive/1h7F5v5xkE_ly - 1bTQSu - 1xaLtTP2TnLF?usp = sharing)

元のProtT5 - XL - U50はAA配列のみを埋め込むことができましたが、ProstT5は現在、3Diトークンで表される3D構造も埋め込むことができます。3Diトークンは、Foldseekを使用して3D構造から導出することも、ProstT5によってAA配列から予測することもできます。 3. "Folding": 配列（AA）から構造（3Di）への翻訳。結果として得られる3Di文字列は、Foldseekと一緒に使用して、3D構造を明示的に計算することなく遠縁相同性検出を行うことができます。 4. "Inverse Folding": 構造（3Di）から配列（AA）への翻訳。

🚀 クイックスタート

特徴抽出

from transformers import T5Tokenizer, T5EncoderModel
import torch
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('Rostlab/ProstT5', do_lower_case=False).to(device)

# Load the model
model = T5EncoderModel.from_pretrained("Rostlab/ProstT5").to(device)

# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
model.full() if device=='cpu' else model.half()

# prepare your protein sequences/structures as a list. Amino acid sequences are expected to be upper-case ("PRTEINO" below) while 3Di-sequences need to be lower-case ("strctr" below).
sequence_examples = ["PRTEINO", "strct"]

# replace all rare/ambiguous amino acids by X (3Di sequences does not have those) and introduce white-space between all sequences (AAs and 3Di)
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# add pre-fixes accordingly (this already expects 3Di-sequences to be lower-case)
# if you go from AAs to 3Di (or if you want to embed AAs), you need to prepend "<AA2fold>"
# if you go from 3Di to AAs (or if you want to embed 3Di), you need to prepend "<fold2AA>"
sequence_examples = [ "<AA2fold>" + " " + s if s.isupper() else "<fold2AA>" + " " + s
                      for s in sequence_examples
                    ]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequences_example, add_special_tokens=True, padding="longest",return_tensors='pt').to(device))

# generate embeddings
with torch.no_grad():
    embedding_rpr = model(
              ids.input_ids, 
              attention_mask=ids.attention_mask
              )

# extract residue embeddings for the first ([0,:]) sequence in the batch and remove padded & special tokens, incl. prefix ([0,1:8]) 
emb_0 = embedding_repr.last_hidden_state[0,1:8] # shape (7 x 1024)
# same for the second ([1,:]) sequence but taking into account different sequence lengths ([1,:6])
emb_1 = embedding_repr.last_hidden_state[1,1:6] # shape (5 x 1024)

# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)

翻訳 ("folding", すなわちAAから3Di)

from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
import torch
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('Rostlab/ProstT5', do_lower_case=False).to(device)

# Load the model
model = AutoModelForSeq2SeqLM.from_pretrained("Rostlab/ProstT5").to(device)

# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
model.full() if device=='cpu' else model.half()

# prepare your protein sequences/structures as a list.
# Amino acid sequences are expected to be upper-case ("PRTEINO" below)
# while 3Di-sequences need to be lower-case.
sequence_examples = ["PRTEINO", "SEQWENCE"]
min_len = min([ len(s) for s in folding_example])
max_len = max([ len(s) for s in folding_example])

# replace all rare/ambiguous amino acids by X (3Di sequences does not have those) and introduce white-space between all sequences (AAs and 3Di)
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# add pre-fixes accordingly. For the translation from AAs to 3Di, you need to prepend "<AA2fold>"
sequence_examples = [ "<AA2fold>" + " " + s for s in sequence_examples]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequences_example,
                                  add_special_tokens=True,
                                  padding="longest",
                                  return_tensors='pt').to(device))

# Generation configuration for "folding" (AA-->3Di)
gen_kwargs_aa2fold = {
                  "do_sample": True,
                  "num_beams": 3, 
                  "top_p" : 0.95, 
                  "temperature" : 1.2, 
                  "top_k" : 6,
                  "repetition_penalty" : 1.2,
}

# translate from AA to 3Di (AA-->3Di)
with torch.no_grad():
  translations = model.generate( 
              ids.input_ids, 
              attention_mask=ids.attention_mask, 
              max_length=max_len, # max length of generated text
              min_length=min_len, # minimum length of the generated text
              early_stopping=True, # stop early if end-of-text token is generated
              num_return_sequences=1, # return only a single sequence
              **gen_kwargs_aa2fold
  )
# Decode and remove white-spaces between tokens
decoded_translations = tokenizer.batch_decode( translations, skip_special_tokens=True )
structure_sequences = [ "".join(ts.split(" ")) for ts in decoded_translations ] # predicted 3Di strings

# Now we can use the same model and invert the translation logic
# to generate an amino acid sequence from the predicted 3Di-sequence (3Di-->AA)

# add pre-fixes accordingly. For the translation from 3Di to AA (3Di-->AA), you need to prepend "<fold2AA>"
sequence_examples_backtranslation = [ "<fold2AA>" + " " + s for s in decoded_translations]

# tokenize sequences and pad up to the longest sequence in the batch
ids_backtranslation = tokenizer.batch_encode_plus(sequence_examples_backtranslation,
                                  add_special_tokens=True,
                                  padding="longest",
                                  return_tensors='pt').to(device))

# Example generation configuration for "inverse folding" (3Di-->AA)
gen_kwargs_fold2AA = {
            "do_sample": True,
            "top_p" : 0.90,
            "temperature" : 1.1,
            "top_k" : 6,
            "repetition_penalty" : 1.2,
}

# translate from 3Di to AA (3Di-->AA)
with torch.no_grad():
  backtranslations = model.generate( 
              ids_backtranslation.input_ids, 
              attention_mask=ids_backtranslation.attention_mask, 
              max_length=max_len, # max length of generated text
              min_length=min_len, # minimum length of the generated text
              early_stopping=True, # stop early if end-of-text token is generated
              num_return_sequences=1, # return only a single sequence
              **gen_kwargs_fold2AA
  )
# Decode and remove white-spaces between tokens
decoded_backtranslations = tokenizer.batch_decode( backtranslations, skip_special_tokens=True )
aminoAcid_sequences = [ "".join(ts.split(" ")) for ts in decoded_backtranslations ] # predicted amino acid strings