模型简介
模型特点
模型能力
使用案例
🚀 ESM++
ESM++是ESMC(许可证)的忠实实现,它支持批量处理,并且无需ESM Python包即可与标准的Huggingface兼容。小版本对应于ESMC的3亿参数版本。该项目解决了在不依赖ESM Python包的情况下,实现与Huggingface的标准兼容以及批量处理的问题,为相关研究和开发提供了便利。
🚀 快速开始
之前Huggingface权重绑定存在一个bug,导致ESM++的对数几率与ESMC不同。该bug现已修复。
✨ 主要特性
- 忠实实现ESMC,支持批量处理。
- 无需ESM Python包,与标准Huggingface兼容。
- 支持序列和标记级别的分类任务。
- 提供不同浮点精度的加载选项。
- 可返回注意力图。
- 相比ESMC和其他模型,推理速度更快。
📦 安装指南
文档未提及安装步骤,暂不展示。
💻 使用示例
基础用法
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_small', trust_remote_code=True)
tokenizer = model.tokenizer
sequences = ['MPRTEIN', 'MSEQWENCE']
tokenized = tokenizer(sequences, padding=True, return_tensors='pt')
# tokenized['labels'] = tokenized['input_ids'].clone() # correctly mask input_ids and set unmasked instances of labels to -100 for MLM training
output = model(**tokenized) # get all hidden states with output_hidden_states=True
print(output.logits.shape) # language modeling logits, (batch_size, seq_len, vocab_size), (2, 11, 64)
print(output.last_hidden_state.shape) # last hidden state of the model, (batch_size, seq_len, hidden_size), (2, 11, 960)
print(output.loss) # language modeling loss if you passed labels
#print(output.hidden_states) # all hidden states if you passed output_hidden_states=True (in tuple)
高级用法
支持序列和标记级别的分类任务
from transformers import AutoModelForSequenceClassification, AutoModelForTokenClassification
model = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_small', num_labels=2, trust_remote_code=True)
logits = model(**tokenized).logits
print(logits.shape) # (batch_size, num_labels), (2, 2)
以不同浮点精度加载权重
import torch
model = AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_small', trust_remote_code=True, torch_dtype=torch.float16) # or torch.bfloat16
嵌入整个数据集
embedding_dict = model.embed_dataset(
sequences=[
'MALWMRLLPLLALLALWGPDPAAA', ... # list of protein sequences
],
tokenizer=model.tokenizer,
batch_size=2, # adjust for your GPU memory
max_len=512, # adjust for your needs
full_embeddings=False, # if True, no pooling is performed
embed_dtype=torch.float32, # cast to what dtype you want
pooling_types=['mean', 'cls'], # more than one pooling type will be concatenated together
num_workers=0, # if you have many cpu cores, we find that num_workers = 4 is fast for large datasets
sql=False, # if True, embeddings will be stored in SQLite database
sql_db_path='embeddings.db',
save=True, # if True, embeddings will be saved as a .pth file
save_path='embeddings.pth',
)
# embedding_dict is a dictionary mapping sequences to their embeddings as tensors for .pth or numpy arrays for sql
model.embed_dataset()
Args:
sequences: List of protein sequences
batch_size: Batch size for processing
max_len: Maximum sequence length
full_embeddings: Whether to return full residue-wise (True) embeddings or pooled (False)
pooling_type: Type of pooling ('mean' or 'cls')
num_workers: Number of workers for data loading, 0 for the main process
sql: Whether to store embeddings in SQLite database - will be stored in float32
sql_db_path: Path to SQLite database
Returns:
Dictionary mapping sequences to embeddings, or None if sql=True
Note:
- If sql=True, embeddings can only be stored in float32
- sql is ideal if you need to stream a very large dataset for training in real-time
- save=True is ideal if you can store the entire embedding dictionary in RAM
- sql will be used if it is True and save is True or False
- If your sql database or .pth file is already present, they will be scanned first for already embedded sequences
- Sequences will be truncated to max_len and sorted by length in descending order for faster processing
使用🤗 peft进行微调
model = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_small', num_labels=2, trust_remote_code=True)
# these modules handle ESM++ and ESM2 attention layers
target_modules = ["layernorm_qkv.1", "out_proj", "query", "key", "value", "dense"]
lora_config = LoraConfig(
r=8, # choose lora parameters to your liking
lora_alpha=16,
lora_dropout=0.01,
bias="none",
target_modules=target_modules,
)
# Apply LoRA to the model
model = get_peft_model(model, lora_config)
# Unfreeze the classifier head
for param in model.classifier.parameters():
param.requires_grad = True
如需更全面的微调示例,请查看我们的示例脚本此处。
返回注意力图
output = model(**tokenized, output_attentions=True)
att = output.attentions
len(att) # 30, one for each layer, size (batch_size, num_heads, seq_len, seq_len) each
📚 详细文档
不同浮点精度和实现方式的比较
我们测量了fp32权重与fp16或bf16的最后隐藏状态的差异。发现fp16更接近fp32的输出,因此建议以fp16加载。请注意,ESM包也以fp32加载ESMC,但默认转换为bf16,这在推理/训练中各有优缺点,因此可以根据需要选择半精度加载方式。
- FP32与FP16的平均均方误差(MSE):0.00000003
- FP32与BF16的平均均方误差(MSE):0.00000140
我们还测量了ESM++与ESMC(均为bfloat16)在1000个随机序列上的输出差异,以确保与ESM包兼容。
- 最后隐藏状态的平均均方误差(MSE):7.74e - 10
你可以通过将.from_pretrained(...)
替换为.from_pretrained_esm('esmc_300m')
,从ESM包而不是transformers加载权重。
模型探针
我们在各种PLM和标准数据集上采用线性探测技术,类似于我们之前的论文,以评估池化隐藏状态与有价值属性之间的内在相关性。ESMC(以及ESM++)表现出色。
下图展示了在负控制(随机向量嵌入)和最佳表现者之间进行归一化后的性能。分类任务得分在MCC和F1(或多标签的F1max)之间平均,回归任务在Spearman rho和R2之间平均。
推理速度
我们研究了各种ESM模型在H100上的吞吐量。在ESMC和ESM++之间添加高效的批量处理显著提高了吞吐量,即使在批量大小为1的情况下,ESM++也比ESMC更快。ESM++小版本在处理长序列时甚至比ESM2 - 35M更快!在Linux机器上使用PyTorch > 2.5时,提升最为明显。
🔧 技术细节
文档未提及技术实现细节,暂不展示。
📄 许可证
如果你使用此实现或相关工作,请引用它(以及ESMC预印本)。
@misc {ESMPlusPlus,
author = { Hallee, L. and Bichara, D. and Gleghorn, J, P. },
title = { ESMPlusPlus },
year = 2024,
url = { https://huggingface.co/Synthyra/ESMplusplus_small },
doi = { 10.57967/hf/3725 },
publisher = { Hugging Face }
}











