SegmentNT-multi-species開源基因組分割模型 - 精準預測多種基因組元素位置

首頁

Segment Nt Multi Species

由InstaDeepAI開發

SegmentNT-multi-species 是一種基於Nucleotide Transformer的分割模型，用於以單核苷酸分辨率預測多種基因組元素的位置。

蛋白質模型

Transformers

#基因組元素分割 #多物種DNA分析 #單核苷酸分辨率

下載量 102

發布時間 : 3/5/2024

模型概述

該模型是在SegmentNT模型的基礎上，通過在人類及五種選定物種（小鼠、雞、果蠅、斑馬魚和線蟲）的基因組數據集上進行微調而得到的，能夠預測7種主要基因元素的位置。

模型特點

多物種支持

支持人類及五種其他物種（小鼠、雞、果蠅、斑馬魚和線蟲）的基因組分析。

高分辨率分割

能夠以單核苷酸分辨率預測基因組元素的位置。

高效訓練

在DGXH100節點上使用8個GPU進行了3天的微調，共處理了80億個token。

模型能力

基因組元素預測

DNA序列分析

多物種基因組分割

使用案例

基因組學研究

基因元素定位

預測DNA序列中外顯子、內含子等基因元素的位置。

能夠準確識別7種主要基因元素的位置。

跨物種比較

分析不同物種間基因組元素的相似性和差異性。

🚀 多物種核苷酸序列分割模型（segment-nt-multi-species）

SegmentNT-multi-species 是一個分割模型，它藉助 Nucleotide Transformer（NT）DNA 基礎模型，以單核苷酸分辨率預測序列中多種基因組元素的位置。該模型是在包含人類基因組以及 5 個選定物種（小鼠、雞、果蠅、斑馬魚和蠕蟲）基因組的數據集上對 SegmentNT 模型進行微調的結果。

在對多物種基因組進行微調時，我們精心整理了用於訓練 SegmentNT 的註釋子集數據集，主要是因為只有這部分註釋可用於這些物種。因此，這些註釋涉及從 Ensembl 獲得的 7 個主要基因元素，即蛋白質編碼基因、5’非翻譯區（UTR）、3’非翻譯區（UTR）、內含子、外顯子、剪接受體和供體位點。

開發者： InstaDeep

🚀 快速開始

模型來源

倉庫地址： Nucleotide Transformer
論文地址： 使用 DNA 基礎模型以單核苷酸分辨率分割基因組

如何使用

在 transformers 庫的下一個版本發佈之前，若要使用該模型，需要通過以下命令從源代碼安裝該庫：

pip install --upgrade git+https://github.com/huggingface/transformers.git

以下是一個小代碼片段，用於從一個虛擬 DNA 序列中獲取對數幾率（logits）和嵌入（embeddings）。

⚠️ 重要提示

默認情況下，最大序列長度設置為訓練長度 30,000 個核苷酸，即 5001 個標記（包括 CLS 標記）。不過，SegmentNT 已被證明可以推廣到長度達 50,000 個鹼基對的序列。如果需要對長度在 30kbp 到 50kbp 之間的序列進行推理，請確保將配置中的 rescaling_factor 參數更改為 num_dna_tokens_inference / max_num_tokens_nt，其中 num_dna_tokens_inference 是推理時的標記數量（例如，對於長度為 40008 個鹼基對的序列，標記數量為 6669），max_num_tokens_nt 是基礎核苷酸變換器訓練時的最大標記數量，即 2048。

# 加載模型和分詞器
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)

# 選擇輸入序列填充的長度。默認情況下，選擇模型的最大長度，但可以根據需要減小該長度，因為獲取嵌入所需的時間會隨長度顯著增加。
# DNA 標記的數量（不包括前置的 CLS 標記）需要能被 2 的下采樣塊數量次方整除，即 4。
max_length = 12 + 1

assert (max_length - 1) % 4 == 0, (
    "The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by"
     "2 to the power of the number of downsampling block, i.e 4.")

# 創建一個虛擬 DNA 序列並進行分詞
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]

# 推理
attention_mask = tokens != tokenizer.pad_token_id
outs = model(
    tokens,
    attention_mask=attention_mask,
    output_hidden_states=True
)

# 獲取基因組特徵的對數幾率
logits = outs.logits.detach()
# 將其轉換為概率
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print(f"Probabilities shape: {probabilities.shape}")

# 獲取與內含子相關的概率
idx_intron = model.config.features.index("intron")
probabilities_intron = probabilities[:,:,idx_intron]
print(f"Intron probabilities shape: {probabilities_intron.shape}")

✨ 主要特性

藉助 Nucleotide Transformer DNA 基礎模型，以單核苷酸分辨率預測基因組元素位置。
在包含人類及 5 個選定物種基因組的數據集上微調，可用於多物種基因組分析。

📦 安裝指南

在 transformers 庫的下一個版本發佈之前，使用以下命令從源代碼安裝該庫：

pip install --upgrade git+https://github.com/huggingface/transformers.git

💻 使用示例

基礎用法

# 加載模型和分詞器
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)

# 選擇輸入序列填充的長度。默認情況下，選擇模型的最大長度，但可以根據需要減小該長度，因為獲取嵌入所需的時間會隨長度顯著增加。
# DNA 標記的數量（不包括前置的 CLS 標記）需要能被 2 的下采樣塊數量次方整除，即 4。
max_length = 12 + 1

assert (max_length - 1) % 4 == 0, (
    "The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by"
     "2 to the power of the number of downsampling block, i.e 4.")

# 創建一個虛擬 DNA 序列並進行分詞
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]

# 推理
attention_mask = tokens != tokenizer.pad_token_id
outs = model(
    tokens,
    attention_mask=attention_mask,
    output_hidden_states=True
)

# 獲取基因組特徵的對數幾率
logits = outs.logits.detach()
# 將其轉換為概率
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print(f"Probabilities shape: {probabilities.shape}")

# 獲取與內含子相關的概率
idx_intron = model.config.features.index("intron")
probabilities_intron = probabilities[:,:,idx_intron]
print(f"Intron probabilities shape: {probabilities_intron.shape}")

📚 詳細文檔

訓練數據

segment-nt-multi-species 模型在人類、小鼠、雞、果蠅、斑馬魚和蠕蟲的基因組上進行了微調。對於每個物種，都保留了一部分染色體用於訓練監控的驗證集和最終評估的測試集。

訓練過程

預處理

DNA 序列使用 Nucleotide Transformer 分詞器進行分詞，該分詞器將序列分詞為 6 聚體標記，具體描述見關聯倉庫的 Tokenization 部分。該分詞器的詞彙表大小為 4105。模型的輸入形式如下：

<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>

訓練

該模型在配備 8 個 GPU 的 DGXH100 節點上，對總共 80 億個標記進行了 3 天的微調訓練。

架構

該模型由 nucleotide-transformer-v2-500m-multi-species 編碼器組成，我們移除了其中的語言模型頭，並將其替換為一個 1 維 U-Net 分割頭 [4]，該分割頭由 2 個下采樣卷積塊和 2 個上採樣卷積塊組成。每個卷積塊由 2 個卷積層組成，分別有 1024 和 2048 個卷積核。這個額外的分割頭包含 5300 萬個參數，使模型的總參數數量達到 5.62 億。

BibTeX 引用和引用信息

@article{de2024segmentnt,
  title={SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models},
  author={de Almeida, Bernardo P and Dalla-Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others},
  journal={bioRxiv},
  pages={2024--03},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}