prot_t5_xl_uniref50オープンソースのタンパク質配列モデル - 生物物理的特性を捉え、タンパク質研究を支援

ホーム

Prot T5 Xl Uniref50

Rostlabによって開発

T5 - 3Bアーキテクチャに基づくタンパク質配列事前学習モデルで、自己教師付き学習によりタンパク質の生体物理的特性を捉えます。

タンパク質モデル

Transformers

#タンパク質配列埋め込み #自己教師付き事前学習 #残基レベルの特徴抽出

ダウンロード数 78.45k

リリース時間 : 3/2/2022

モデル概要

このモデルはマスク言語モデリングの目標を用いてUniRef50データセットで事前学習され、タンパク質配列から有意義な生物学的特徴表現を抽出でき、タンパク質構造予測や機能解析などのタスクに適しています。

モデル特徴

大規模事前学習

4500万のタンパク質配列を含むUniRef50データセットで事前学習されます。

生体物理的特性の捕捉

モデルが学習した特徴は、タンパク質の三次構造を決定する重要な生体物理的特性を反映できます。

二用途設計

直接的な特徴抽出をサポートすると同時に、特定の下流タスクに対して微調整も可能です。

効率的なマスク戦略

15%のアミノ酸をランダムにマスクする戦略を採用し、そのうち90%は[MASK]に置き換え、10%はランダムなアミノ酸に置き換えます。

モデル能力

タンパク質配列特徴抽出

タンパク質二次構造予測

細胞内局在予測

膜タンパク質検出

タンパク質機能予測

使用事例

構造生物学

タンパク質二次構造予測

タンパク質の3状態または8状態の二次構造を予測します。

CASP12データセットで81%の正解率(3状態)を達成しました。

細胞生物学

細胞内局在予測

タンパク質の細胞内での局在位置を予測します。

DeepLocデータセットで81%の正解率を達成しました。

膜タンパク質検出

膜結合タンパク質と水溶性タンパク質を区別します。

DeepLocデータセットで91%の正解率を達成しました。

🚀 ProtT5-XL-UniRef50モデル

このモデルは、マスク言語モデリング（MLM）の目的でタンパク質配列上で事前学習されたものです。 this paperで紹介され、 this repositoryで最初に公開されました。このモデルは大文字のアミノ酸で学習されており、大文字のアミノ酸のみで動作します。

🚀 クイックスタート

ProtT5-XL-UniRef50モデルは、タンパク質配列の特徴抽出や下流タスクへの微調整に使用できます。以下に、このモデルを使用して特定のタンパク質配列の特徴を抽出する方法を示します。

✨ 主な機能

タンパク質配列の特徴抽出が可能です。
下流タスクに対して微調整することができ、微調整することで一部のタスクでより高い精度を得ることができます。
エンコーダから抽出された特徴を使用すると、特徴抽出においてより良い結果が得られます。

💻 使用例

基本的な使用法

sequence_examples = ["PRTEINO", "SEQWENCE"]
# this will replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

# generate embeddings
with torch.no_grad():
    embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask)

# extract embeddings for the first ([0,:]) sequence in the batch while removing padded & special tokens ([0,:7]) 
emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
print(f"Shape of per-residue embedding of first sequences: {emb_0.shape}")
# do the same for the second ([1,:]) sequence in the batch while taking into account different sequence lengths ([1,:8])
emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)

# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)

print(f"Shape of per-protein embedding of first sequences: {emb_0_per_protein.shape}")

📚 ドキュメント

モデルの説明

ProtT5-XL-UniRef50は、t5-3bモデルをベースに、自己教師付き学習により大量のタンパク質配列コーパスで事前学習されています。これは、人間によるラベル付けが一切なく、生のタンパク質配列のみを使用して、それらのタンパク質配列から入力とラベルを自動生成するプロセスで事前学習されたことを意味します（このため、多くの公開データを利用できます）。

このT5モデルと元のT5バージョンの重要な違いの1つは、ノイズ除去の目的です。元のT5-3Bモデルはスパンノイズ除去の目的で事前学習されていましたが、このモデルはBartのようなMLMノイズ除去の目的で事前学習されています。マスク確率は、入力中のアミノ酸の15％をランダムにマスクするという点で、元のT5学習と一致しています。

この自己教師付きモデルから抽出された特徴（LM埋め込み）は、タンパク質の形状を支配する重要な生体物理的特性を捉えていることが示されています。これは、タンパク質配列に実現された生命の言語の文法の一部を学習したことを意味します。

想定される用途と制限

このモデルは、タンパク質の特徴抽出や下流タスクへの微調整に使用できます。一部のタスクでは、モデルを微調整することで、特徴抽出器として使用するよりも高い精度を得ることができることがわかっています。また、特徴抽出には、デコーダではなくエンコーダから抽出された特徴を使用することをお勧めします。

学習データ

ProtT5-XL-UniRef50モデルは、4500万のタンパク質配列からなるUniRef50データセットで事前学習されています。

学習手順

前処理

タンパク質配列は大文字に変換され、単一の空白を使用してトークン化され、語彙サイズは21です。稀なアミノ酸 "U,Z,O,B" は "X" にマッピングされます。モデルの入力は次の形式になります。

Protein Sequence [EOS]

前処理ステップは、タンパク質配列を最大512トークンに切断してパディングすることで、オンザフライで実行されます。

各配列のマスキング手順の詳細は以下の通りです。

アミノ酸の15％がマスクされます。
90％のケースで、マスクされたアミノ酸は [MASK] トークンに置き換えられます。
10％のケースで、マスクされたアミノ酸は、置き換えるアミノ酸とは異なるランダムなアミノ酸に置き換えられます。

事前学習

このモデルは、シーケンス長512（バッチサイズ2k）を使用して、合計991.5千ステップ、単一のTPU Pod V2-256で学習されました。ゼロから学習するのではなく、ProtT5-XL-BFDモデルを初期チェックポイントとして使用して学習されました。このモデルは約30億のパラメータを持ち、エンコーダ - デコーダアーキテクチャを使用して学習されました。事前学習には、逆平方根学習率スケジュールを持つAdaFactorオプティマイザが使用されました。

評価結果

このモデルを特徴抽出に使用した場合、以下の結果が得られます。

Test results :

Task/Dataset	secondary structure (3-states)	secondary structure (8-states)	Localization	Membrane
CASP12	81	70
TS115	87	77
CB513	86	74
DeepLoc			81	91

BibTeX引用

@article {Elnaggar2020.07.12.199554,
	author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
	title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
	elocation-id = {2020.07.12.199554},
	year = {2020},
	doi = {10.1101/2020.07.12.199554},
	publisher = {Cold Spring Harbor Laboratory},
	abstract = {Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112 times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8 states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs. Availability ProtTrans: \&lt;a href="https://github.com/agemagician/ProtTrans"\&gt;https://github.com/agemagician/ProtTrans\&lt;/a\&gt;Competing Interest StatementThe authors have declared no competing interest.},
	URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
	eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
	journal = {bioRxiv}
}