模型概述
模型特點
模型能力
使用案例
🚀 SegmentNT
SegmentNT是一個分割模型,它利用Nucleotide Transformer(NT)DNA基礎模型,以單核苷酸分辨率預測序列中多種類型基因組元素的位置。該模型在輸入序列長達30kb的14種不同類別的人類基因組元素上進行訓練。這些元素包括基因(蛋白質編碼基因、長鏈非編碼RNA、5'非翻譯區、3'非翻譯區、外顯子、內含子、剪接受體和供體位點)和調控(多聚腺苷酸信號、組織不變和組織特異性啟動子和增強子,以及CTCF結合位點)元素。
開發者: InstaDeep
🚀 快速開始
模型來源
- 倉庫: Nucleotide Transformer
- 論文: Segmenting the genome at single-nucleotide resolution with DNA foundation models
如何使用
在transformers
庫的下一個版本發佈之前,若要使用該模型,需要通過以下命令從源代碼安裝該庫:
pip install --upgrade git+https://github.com/huggingface/transformers.git
這裡給出一段小代碼片段,用於從一個虛擬DNA序列中獲取對數幾率和嵌入向量。
⚠️ 默認情況下,最大序列長度設置為訓練長度30,000個核苷酸,即5001個標記(包括CLS標記)。不過,SegmentNT多物種模型已被證明可以推廣到長度達50,000 bp的序列。如果需要對長度在30kbp到50kbp之間的序列進行推理,請確保將esm模型中旋轉嵌入層的rescaling_factor
更改為num_dna_tokens_inference / max_num_tokens_nt
,其中num_dna_tokens_inference
是推理時的標記數量(例如,對於長度為40008個鹼基對的序列,該值為6669),max_num_tokens_nt
是基礎核苷酸變換器訓練時的最大標記數量,即2048
。
點擊圖標可在Google Colab中運行
./inference_segment_nt.ipynb
,該筆記本展示瞭如何處理需要更改縮放因子的序列長度以及不需要更改的序列長度的推理。可以運行該筆記本並重現SegmentNT論文中的圖1和圖3。
# Load model and tokenizer
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)
# Choose the length to which the input sequences are padded. By default, the
# model max length is chosen, but feel free to decrease it as the time taken to
# obtain the embeddings increases significantly with it.
# The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by
# 2 to the power of the number of downsampling block, i.e 4.
max_length = 12 + 1
assert (max_length - 1) % 4 == 0, (
"The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by"
"2 to the power of the number of downsampling block, i.e 4.")
# Create a dummy dna sequence and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]
# Infer
attention_mask = tokens != tokenizer.pad_token_id
outs = model(
tokens,
attention_mask=attention_mask,
output_hidden_states=True
)
# Obtain the logits over the genomic features
logits = outs.logits.detach()
# Transform them in probabilities
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print(f"Probabilities shape: {probabilities.shape}")
# Get probabilities associated with intron
idx_intron = model.config.features.index("intron")
probabilities_intron = probabilities[:,:,idx_intron]
print(f"Intron probabilities shape: {probabilities_intron.shape}")
✨ 主要特性
SegmentNT模型利用Nucleotide Transformer DNA基礎模型,能夠以單核苷酸分辨率預測基因組元素位置,可處理多種類型的基因和調控元素。
💻 使用示例
基礎用法
# Load model and tokenizer
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)
# Choose the length to which the input sequences are padded. By default, the
# model max length is chosen, but feel free to decrease it as the time taken to
# obtain the embeddings increases significantly with it.
# The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by
# 2 to the power of the number of downsampling block, i.e 4.
max_length = 12 + 1
assert (max_length - 1) % 4 == 0, (
"The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by"
"2 to the power of the number of downsampling block, i.e 4.")
# Create a dummy dna sequence and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]
# Infer
attention_mask = tokens != tokenizer.pad_token_id
outs = model(
tokens,
attention_mask=attention_mask,
output_hidden_states=True
)
# Obtain the logits over the genomic features
logits = outs.logits.detach()
# Transform them in probabilities
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print(f"Probabilities shape: {probabilities.shape}")
# Get probabilities associated with intron
idx_intron = model.config.features.index("intron")
probabilities_intron = probabilities[:,:,idx_intron]
print(f"Intron probabilities shape: {probabilities_intron.shape}")
高級用法
在處理長度在30kbp到50kbp之間的序列時,需要更改rescaling_factor
參數,示例代碼中已給出相關說明。
📚 詳細文檔
訓練數據
SegmentNT模型在除20號和21號染色體(作為測試集)以及22號染色體(作為驗證集)之外的所有人類染色體上進行訓練。訓練期間,在基因組中隨機採樣序列並關聯註釋。不過,在驗證集和測試集中,我們通過在20號和21號染色體上使用長度為30,000的滑動窗口來固定序列。驗證集用於監控訓練過程並進行早停。
訓練過程
預處理
DNA序列使用Nucleotide Transformer分詞器進行分詞,該分詞器將序列分詞為6 - 聚體標記,具體如關聯倉庫的Tokenization部分所述。該分詞器的詞彙表大小為4105。模型的輸入形式如下:
<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>
訓練
模型在配備8個GPU的DGXH100節點上進行訓練,共處理230億個標記,訓練時長為3天。模型依次在3kb、10kb、20kb和30kb的序列上進行訓練,每次的有效批量大小為256個序列。
架構
該模型由nucleotide-transformer-v2-500m-multi-species編碼器組成,我們移除了其語言模型頭,並替換為一個1維U - Net分割頭[4],該分割頭由2個下采樣卷積塊和2個上採樣卷積塊組成。每個塊由2個卷積層組成,分別有1024和2048個卷積核。這個額外的分割頭包含5300萬個參數,使模型的總參數數量達到5.62億。
BibTeX引用
@article{de2024segmentnt,
title={SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models},
author={de Almeida, Bernardo P and Dalla-Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others},
journal={bioRxiv},
pages={2024--03},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}
📄 許可證
本項目採用CC - BY - NC - SA 4.0許可證。







