ESMplusplus_smallオープンソースモデル - 標準インターフェースと互換性があり、小型版はバッチ処理をサポートし、無料で使用できます

ホーム

Esmplusplus Small

Synthyraによって開発

ESM++はESMCの忠実な実装であり、バッチ処理をサポートし、標準のHuggingfaceインターフェースと互換性があり、ESM Pythonパッケージに依存しません。小型バージョンはESMCの3億パラメータバージョンに対応します。

タンパク質モデル

Transformers

#タンパク質配列解析 #効率的なバッチ処理 #生物医学研究

ダウンロード数 6,460

リリース時間 : 12/4/2024

モデル概要

ESM++はタンパク質言語モデルで、タンパク質配列のマスク言語モデリング、配列分類、トークン分類タスクに使用されます。

モデル特徴

効率的なバッチ処理

ESMCと比較して、ESM++は効率的なバッチ処理によりスループットを大幅に向上させ、バッチサイズが1の場合でも高速です。

Huggingfaceインターフェース互換

標準のHuggingfaceインターフェースと完全互換で、ESM Pythonパッケージに依存しません。

マルチ精度サポート

fp32、fp16、bf16精度をサポートし、fp16バージョンはfp32出力に近く、推奨されます。

高速埋め込み

embed_datasetメソッドを提供し、タンパク質配列データセット全体を迅速に埋め込むことができます。

モデル能力

タンパク質配列埋め込み

マスク言語モデリング

配列分類

トークン分類

アテンションマップ生成

使用事例

タンパク質研究

タンパク質機能予測

配列分類機能を使用してタンパク質機能を予測します。

タンパク質構造予測

タンパク質配列埋め込みを利用して構造を予測します。

🚀 ESM++

ESM++ は、ESMC（ライセンス）を忠実に実装したものです。ESM Pythonパッケージを必要とせずに、バッチ処理と標準的なHuggingface互換性を提供します。小規模バージョンは、ESMCの3億パラメータバージョンに相当します。

🚀 クイックスタート

以前、Huggingfaceの重み共有に関するバグがあり、ESM++のロジットがESMCと異なる原因となっていました。このバグは現在解決されています。

✨ 主な機能

ESM++は、ESM Pythonパッケージを必要とせずに、バッチ処理とHuggingface互換性を提供します。
シーケンスとトークンレベルの分類タスクをサポートしています。
fp32、fp16、bf16の重みをサポートしています。
🤗 peftを使用した微調整が可能です。
注意マップを返すオプションがあります。

💻 使用例

基本的な使用法

from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_small', trust_remote_code=True)
tokenizer = model.tokenizer

sequences = ['MPRTEIN', 'MSEQWENCE']
tokenized = tokenizer(sequences, padding=True, return_tensors='pt')

# tokenized['labels'] = tokenized['input_ids'].clone() # correctly mask input_ids and set unmasked instances of labels to -100 for MLM training

output = model(**tokenized) # get all hidden states with output_hidden_states=True
print(output.logits.shape) # language modeling logits, (batch_size, seq_len, vocab_size), (2, 11, 64)
print(output.last_hidden_state.shape) # last hidden state of the model, (batch_size, seq_len, hidden_size), (2, 11, 960)
print(output.loss) # language modeling loss if you passed labels
#print(output.hidden_states) # all hidden states if you passed output_hidden_states=True (in tuple)

高度な使用法

シーケンス分類タスク

from transformers import AutoModelForSequenceClassification, AutoModelForTokenClassification

model = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_small', num_labels=2, trust_remote_code=True)
logits = model(**tokenized).logits
print(logits.shape) # (batch_size, num_labels), (2, 2)

データセットの埋め込み

embedding_dict = model.embed_dataset(
    sequences=[
        'MALWMRLLPLLALLALWGPDPAAA', ... # list of protein sequences
    ],
    tokenizer=model.tokenizer,
    batch_size=2, # adjust for your GPU memory
    max_len=512, # adjust for your needs
    full_embeddings=False, # if True, no pooling is performed
    embed_dtype=torch.float32, # cast to what dtype you want
    pooling_types=['mean', 'cls'], # more than one pooling type will be concatenated together
    num_workers=0, # if you have many cpu cores, we find that num_workers = 4 is fast for large datasets
    sql=False, # if True, embeddings will be stored in SQLite database
    sql_db_path='embeddings.db',
    save=True, # if True, embeddings will be saved as a .pth file
    save_path='embeddings.pth',
)
# embedding_dict is a dictionary mapping sequences to their embeddings as tensors for .pth or numpy arrays for sql

model.embed_dataset()
Args:
    sequences: List of protein sequences
    batch_size: Batch size for processing
    max_len: Maximum sequence length
    full_embeddings: Whether to return full residue-wise (True) embeddings or pooled (False)
    pooling_type: Type of pooling ('mean' or 'cls')
    num_workers: Number of workers for data loading, 0 for the main process
    sql: Whether to store embeddings in SQLite database - will be stored in float32
    sql_db_path: Path to SQLite database
    
Returns:
    Dictionary mapping sequences to embeddings, or None if sql=True

Note:
    - If sql=True, embeddings can only be stored in float32
    - sql is ideal if you need to stream a very large dataset for training in real-time
    - save=True is ideal if you can store the entire embedding dictionary in RAM
    - sql will be used if it is True and save is True or False
    - If your sql database or .pth file is already present, they will be scanned first for already embedded sequences
    - Sequences will be truncated to max_len and sorted by length in descending order for faster processing

🤗 peftを使用した微調整

model = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_small', num_labels=2, trust_remote_code=True)
# these modules handle ESM++ and ESM2 attention layers
target_modules = ["layernorm_qkv.1", "out_proj", "query", "key", "value", "dense"]

lora_config = LoraConfig(
    r=8, # choose lora parameters to your liking
    lora_alpha=16,
    lora_dropout=0.01,
    bias="none",
    target_modules=target_modules,
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Unfreeze the classifier head
for param in model.classifier.parameters():
    param.requires_grad = True

注意マップの返却

output = model(**tokenized, output_attentions=True)
att = output.attentions
len(att) # 30, one for each layer, size (batch_size, num_heads, seq_len, seq_len) each

📚 ドキュメント

浮動小数点精度と実装の比較

fp32の重みとfp16またはbf16の最後の隠れ層の状態の差を測定しました。fp16はfp32の出力に近いことがわかったので、fp16での読み込みをおすすめします。

Average MSE FP32 vs. FP16: 0.00000003
Average MSE FP32 vs. BF16: 0.00000140

また、1000のランダムなシーケンスに対するESM++とESMC（どちらもbfloat16）の出力の差を測定し、ESMパッケージとの互換性を確認しました。

Average MSE of last hidden state: 7.74e-10

.transformersの代わりにESMパッケージから重みを読み込むには、.from_pretrained(...)を.from_pretrained_esm('esmc_300m')に置き換えます。

モデルプローブ

以前の論文と同様に、さまざまなPLMと標準データセットに線形プロービング技術を適用し、プールされた隠れ層の状態と重要な特性の間の内在的な相関を評価します。ESMC（したがってESM++）は非常に良好な性能を示します。

image/png

推論速度

さまざまなESMモデルとそれらのH100でのスループットを調べました。ESMCとESM++の間で効率的なバッチ処理を追加することで、スループットが大幅に向上します。バッチサイズが1の場合でも、ESM++はESMCよりも高速です。ESM++ smallは、長いシーケンスではESM2-35Mよりもさらに高速です！LinuxマシンでPyTorch > 2.5を使用すると、最も大きなメリットが得られます。

image/png

引用

この実装や成果を使用する場合は、（ESM Cのプレプリントとともに）引用してください。

@misc {ESMPlusPlus,
	author       = { Hallee, L. and Bichara, D. and Gleghorn, J, P. },
	title        = { ESMPlusPlus },
	year         = 2024,
	url          = { https://huggingface.co/Synthyra/ESMplusplus_small },
	doi          = { 10.57967/hf/3725 },
	publisher    = { Hugging Face }
}