模型简介
模型特点
模型能力
使用案例
🚀 SegmentNT
SegmentNT是一个分割模型,它利用Nucleotide Transformer(NT)DNA基础模型,以单核苷酸分辨率预测序列中多种类型基因组元素的位置。该模型在输入序列长达30kb的14种不同类别的人类基因组元素上进行训练。这些元素包括基因(蛋白质编码基因、长链非编码RNA、5'非翻译区、3'非翻译区、外显子、内含子、剪接受体和供体位点)和调控(多聚腺苷酸信号、组织不变和组织特异性启动子和增强子,以及CTCF结合位点)元素。
开发者: InstaDeep
🚀 快速开始
模型来源
- 仓库: Nucleotide Transformer
- 论文: Segmenting the genome at single-nucleotide resolution with DNA foundation models
如何使用
在transformers
库的下一个版本发布之前,若要使用该模型,需要通过以下命令从源代码安装该库:
pip install --upgrade git+https://github.com/huggingface/transformers.git
这里给出一段小代码片段,用于从一个虚拟DNA序列中获取对数几率和嵌入向量。
⚠️ 默认情况下,最大序列长度设置为训练长度30,000个核苷酸,即5001个标记(包括CLS标记)。不过,SegmentNT多物种模型已被证明可以推广到长度达50,000 bp的序列。如果需要对长度在30kbp到50kbp之间的序列进行推理,请确保将esm模型中旋转嵌入层的rescaling_factor
更改为num_dna_tokens_inference / max_num_tokens_nt
,其中num_dna_tokens_inference
是推理时的标记数量(例如,对于长度为40008个碱基对的序列,该值为6669),max_num_tokens_nt
是基础核苷酸变换器训练时的最大标记数量,即2048
。
点击图标可在Google Colab中运行
./inference_segment_nt.ipynb
,该笔记本展示了如何处理需要更改缩放因子的序列长度以及不需要更改的序列长度的推理。可以运行该笔记本并重现SegmentNT论文中的图1和图3。
# Load model and tokenizer
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)
# Choose the length to which the input sequences are padded. By default, the
# model max length is chosen, but feel free to decrease it as the time taken to
# obtain the embeddings increases significantly with it.
# The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by
# 2 to the power of the number of downsampling block, i.e 4.
max_length = 12 + 1
assert (max_length - 1) % 4 == 0, (
"The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by"
"2 to the power of the number of downsampling block, i.e 4.")
# Create a dummy dna sequence and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]
# Infer
attention_mask = tokens != tokenizer.pad_token_id
outs = model(
tokens,
attention_mask=attention_mask,
output_hidden_states=True
)
# Obtain the logits over the genomic features
logits = outs.logits.detach()
# Transform them in probabilities
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print(f"Probabilities shape: {probabilities.shape}")
# Get probabilities associated with intron
idx_intron = model.config.features.index("intron")
probabilities_intron = probabilities[:,:,idx_intron]
print(f"Intron probabilities shape: {probabilities_intron.shape}")
✨ 主要特性
SegmentNT模型利用Nucleotide Transformer DNA基础模型,能够以单核苷酸分辨率预测基因组元素位置,可处理多种类型的基因和调控元素。
💻 使用示例
基础用法
# Load model and tokenizer
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)
# Choose the length to which the input sequences are padded. By default, the
# model max length is chosen, but feel free to decrease it as the time taken to
# obtain the embeddings increases significantly with it.
# The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by
# 2 to the power of the number of downsampling block, i.e 4.
max_length = 12 + 1
assert (max_length - 1) % 4 == 0, (
"The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by"
"2 to the power of the number of downsampling block, i.e 4.")
# Create a dummy dna sequence and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]
# Infer
attention_mask = tokens != tokenizer.pad_token_id
outs = model(
tokens,
attention_mask=attention_mask,
output_hidden_states=True
)
# Obtain the logits over the genomic features
logits = outs.logits.detach()
# Transform them in probabilities
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print(f"Probabilities shape: {probabilities.shape}")
# Get probabilities associated with intron
idx_intron = model.config.features.index("intron")
probabilities_intron = probabilities[:,:,idx_intron]
print(f"Intron probabilities shape: {probabilities_intron.shape}")
高级用法
在处理长度在30kbp到50kbp之间的序列时,需要更改rescaling_factor
参数,示例代码中已给出相关说明。
📚 详细文档
训练数据
SegmentNT模型在除20号和21号染色体(作为测试集)以及22号染色体(作为验证集)之外的所有人类染色体上进行训练。训练期间,在基因组中随机采样序列并关联注释。不过,在验证集和测试集中,我们通过在20号和21号染色体上使用长度为30,000的滑动窗口来固定序列。验证集用于监控训练过程并进行早停。
训练过程
预处理
DNA序列使用Nucleotide Transformer分词器进行分词,该分词器将序列分词为6 - 聚体标记,具体如关联仓库的Tokenization部分所述。该分词器的词汇表大小为4105。模型的输入形式如下:
<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>
训练
模型在配备8个GPU的DGXH100节点上进行训练,共处理230亿个标记,训练时长为3天。模型依次在3kb、10kb、20kb和30kb的序列上进行训练,每次的有效批量大小为256个序列。
架构
该模型由nucleotide-transformer-v2-500m-multi-species编码器组成,我们移除了其语言模型头,并替换为一个1维U - Net分割头[4],该分割头由2个下采样卷积块和2个上采样卷积块组成。每个块由2个卷积层组成,分别有1024和2048个卷积核。这个额外的分割头包含5300万个参数,使模型的总参数数量达到5.62亿。
BibTeX引用
@article{de2024segmentnt,
title={SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models},
author={de Almeida, Bernardo P and Dalla-Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others},
journal={bioRxiv},
pages={2024--03},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}
📄 许可证
本项目采用CC - BY - NC - SA 4.0许可证。







