模型概述
模型特點
模型能力
使用案例
🚀 ESM++
ESM++ 是 ESMC(許可證)的忠實實現,它支持批量處理,並且無需 ESM Python 包即可與標準的 Huggingface 兼容。大版本對應 ESMC 的 6 億參數版本。
🚀 快速開始
注意事項
之前 Huggingface 的權重綁定存在一個 bug,導致 ESM++ 的對數幾率與 ESMC 不同。該 bug 現已修復。
✨ 主要特性
- 忠實實現 ESMC,支持批量處理和 Huggingface 兼容。
- 支持序列和標記級別的分類任務。
- 支持以不同浮點精度加載權重。
- 支持返回注意力圖。
- 可進行微調。
- 提供模型探針評估。
- 具有較高的推理速度。
📦 安裝指南
文檔未提及安裝步驟,故跳過此章節。
💻 使用示例
基礎用法
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True)
tokenizer = model.tokenizer
sequences = ['MPRTEIN', 'MSEQWENCE']
tokenized = tokenizer(sequences, padding=True, return_tensors='pt')
# tokenized['labels'] = tokenized['input_ids'].clone() # correctly mask input_ids and set unmasked instances of labels to -100 for MLM training
output = model(**tokenized) # get all hidden states with output_hidden_states=True
print(output.logits.shape) # language modeling logits, (batch_size, seq_len, vocab_size), (2, 11, 64)
print(output.last_hidden_state.shape) # last hidden state of the model, (batch_size, seq_len, hidden_size), (2, 11, 1152)
print(output.loss) # language modeling loss if you passed labels
#print(output.hidden_states) # all hidden states if you passed output_hidden_states=True (in tuple)
高級用法
支持序列和標記級別的分類任務
from transformers import AutoModelForSequenceClassification, AutoModelForTokenClassification
model = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_large', num_labels=2, trust_remote_code=True)
logits = model(**tokenized).logits
print(logits.shape) # (batch_size, num_labels), (2, 2)
以不同浮點精度加載權重
import torch
model = AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, torch_dtype=torch.float16) # or torch.bfloat16
嵌入整個數據集
embedding_dict = model.embed_dataset(
sequences=[
'MALWMRLLPLLALLALWGPDPAAA', ... # list of protein sequences
],
tokenizer=model.tokenizer,
batch_size=2, # adjust for your GPU memory
max_len=512, # adjust for your needs
full_embeddings=False, # if True, no pooling is performed
embed_dtype=torch.float32, # cast to what dtype you want
pooling_types=['mean', 'cls'], # more than one pooling type will be concatenated together
num_workers=0, # if you have many cpu cores, we find that num_workers = 4 is fast for large datasets
sql=False, # if True, embeddings will be stored in SQLite database
sql_db_path='embeddings.db',
save=True, # if True, embeddings will be saved as a .pth file
save_path='embeddings.pth',
)
# embedding_dict is a dictionary mapping sequences to their embeddings as tensors for .pth or numpy arrays for sql
model.embed_dataset()
Args:
sequences: List of protein sequences
batch_size: Batch size for processing
max_len: Maximum sequence length
full_embeddings: Whether to return full residue-wise (True) embeddings or pooled (False)
pooling_type: Type of pooling ('mean' or 'cls')
num_workers: Number of workers for data loading, 0 for the main process
sql: Whether to store embeddings in SQLite database - will be stored in float32
sql_db_path: Path to SQLite database
Returns:
Dictionary mapping sequences to embeddings, or None if sql=True
Note:
- If sql=True, embeddings can only be stored in float32
- sql is ideal if you need to stream a very large dataset for training in real-time
- save=True is ideal if you can store the entire embedding dictionary in RAM
- sql will be used if it is True and save is True or False
- If your sql database or .pth file is already present, they will be scanned first for already embedded sequences
- Sequences will be truncated to max_len and sorted by length in descending order for faster processing
使用 🤗 peft 進行微調
model = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_large', num_labels=2, trust_remote_code=True)
# these modules handle ESM++ and ESM2 attention layers
target_modules = ["layernorm_qkv.1", "out_proj", "query", "key", "value", "dense"]
lora_config = LoraConfig(
r=8, # choose lora parameters to your liking
lora_alpha=16,
lora_dropout=0.01,
bias="none",
target_modules=target_modules,
)
# Apply LoRA to the model
model = get_peft_model(model, lora_config)
# Unfreeze the classifier head
for param in model.classifier.parameters():
param.requires_grad = True
返回注意力圖
output = model(**tokenized, output_attentions=True)
att = output.attentions
len(att) # 33, one for each layer, size (batch_size, num_heads, seq_len, seq_len) each
📚 詳細文檔
從 ESM 包加載權重
你可以通過將 .from_pretrained(...)
替換為 .from_pretrained_esm('esmc_600m')
來從 ESM 包而不是 transformers 加載權重。
模型探針
我們採用線性探測技術對各種蛋白質語言模型(PLMs)和標準數據集進行評估,類似於我們之前的 論文,以評估池化隱藏狀態與有價值屬性之間的內在相關性。ESMC(以及 ESM++)表現非常出色。
推理速度
我們研究了各種 ESM 模型在 H100 上的吞吐量。在 ESMC 和 ESM++ 之間添加高效的批量處理顯著提高了吞吐量,儘管 ESM++ 在批量大小為 1 時也比 ESMC 更快。ESM++ 小版本在處理長序列時甚至比 ESM2 - 35M 更快!在 Linux 機器上使用 PyTorch > 2.5 時,收益最為明顯。
🔧 技術細節
浮點精度和實現的比較
我們測量了 fp32 權重與 fp16 或 bf16 的最後隱藏狀態的差異。我們發現 fp16 更接近 fp32 的輸出,因此建議以 fp16 加載。
請注意,ESM 包也以 fp32 加載 ESMC,但默認轉換為 bf16,這在推理/訓練中各有優缺點 - 因此你可以根據需要選擇半精度加載。
FP16 的平均均方誤差(MSE):0.00000003
BF16 的平均均方誤差(MSE):0.00000122
我們還測量了 ESM++ 與 ESMC(均為 bfloat16)在 1000 個隨機序列上的輸出差異,以確保與 ESM 包兼容。
最後隱藏狀態的平均均方誤差(MSE):2.46e - 09
📄 許可證
請參考 ESMC 許可證。
引用
如果你使用了此實現或相關工作,請引用它(以及 ESMC 預印本)。
@misc {ESMPlusPlus,
author = { Hallee, L. and Bichara, D. and Gleghorn, J, P. },
title = { ESMPlusPlus },
year = 2024,
url = { https://huggingface.co/Synthyra/ESMplusplus_small },
doi = { 10.57967/hf/3726 },
publisher = { Hugging Face }
}
微調示例
如需更詳細的微調示例,請查看我們的示例腳本 此處。











