SegmentBorzoi开源基因组分割模型 - 免费预测多种基因组元素单核苷酸位置

首页

Segment Borzoi

由 InstaDeepAI 开发

SegmentBorzoi 是一种基于 Borzoi 的分割模型，用于预测序列中多种基因组元素在单核苷酸分辨率下的位置。

蛋白质模型 #基因组单核苷酸分辨率预测 #DNA序列分割 #多类别基因组元素标注

下载量 37

发布时间 : 12/24/2024

模型简介

该模型在14种不同类别上进行了训练，包括基因（蛋白质编码基因、lncRNA、5'UTR、3'UTR、外显子、内含子、剪接受体和供体位点）和调控元件（polyA信号、组织不变和组织特异性启动子及增强子，以及CTCF结合位点）。

模型特点

高分辨率预测

能够在单核苷酸分辨率下预测基因组元素的位置。

多类别训练

在14种不同类别的基因组元素上进行了训练，包括基因和调控元件。

基于Borzoi架构

利用Borzoi的主干架构，并替换为1维U-Net分割头以提高分割性能。

模型能力

基因组元素位置预测

DNA序列分析

高分辨率分割

使用案例

基因组研究

基因位置预测

预测蛋白质编码基因、lncRNA等基因在DNA序列中的位置。

调控元件分析

识别polyA信号、启动子、增强子等调控元件的位置。

🚀 SegmentBorzoi

SegmentBorzoi是一个分割模型，它利用Borzoi，以单核苷酸分辨率预测序列中几种基因组元素的位置。该模型在14种不同的类别上进行训练，包括基因（蛋白质编码基因、长链非编码RNA、5'非翻译区、3'非翻译区、外显子、内含子、剪接受体和供体位点）和调控（多聚腺苷酸信号、组织不变和组织特异性启动子和增强子，以及CTCF结合位点）元素。

🚀 快速开始

在该模型的下一个版本发布之前，为了使用这些模型，需要从源代码安装transformers库。同时，还需要安装PyTorch、einops和borzoi_pytorch。

pip install --upgrade git+https://github.com/huggingface/transformers.git
pip install torch einops borzoi_pytorch==0.4.0

以下是一段代码示例，用于从虚拟DNA序列中获取对数几率（logits）。

💻 使用示例

基础用法

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained("InstaDeepAI/segment_borzoi", trust_remote_code=True)

def encode_sequences(sequences):
    one_hot_map = {
        'a': torch.tensor([1., 0., 0., 0.]),
        'c': torch.tensor([0., 1., 0., 0.]),
        'g': torch.tensor([0., 0., 1., 0.]),
        't': torch.tensor([0., 0., 0., 1.]),
        'n': torch.tensor([0., 0., 0., 0.]),
        'A': torch.tensor([1., 0., 0., 0.]),
        'C': torch.tensor([0., 1., 0., 0.]),
        'G': torch.tensor([0., 0., 1., 0.]),
        'T': torch.tensor([0., 0., 0., 1.]),
        'N': torch.tensor([0., 0., 0., 0.])
    }

    def encode_sequence(seq_str):
        one_hot_list = []
        for char in seq_str:
            one_hot_vector = one_hot_map.get(char, torch.tensor([0.25, 0.25, 0.25, 0.25]))
            one_hot_list.append(one_hot_vector)
        return torch.stack(one_hot_list)

    if isinstance(sequences, list):
        return torch.stack([encode_sequence(seq) for seq in sequences])
    else:
        return encode_sequence(sequences)

sequences = ["A"*524_288, "G"*524_288]
one_hot_encoding = encode_sequences(sequences)
preds = model(one_hot_encoding)
print(preds['logits'])

📦 训练数据

SegmentBorzoi模型在除20号和21号染色体（作为测试集）以及22号染色体（作为验证集）之外的所有人染色体上进行训练。在训练过程中，序列是从基因组中随机采样并关联注释的。不过，我们通过在20号和21号染色体上使用长度为524kb（原始Borzoi输入长度）的滑动窗口，固定了验证集和测试集中的序列。验证集用于监控训练过程和提前停止训练。

🔧 训练过程

预处理

DNA序列使用与Enformer模型类似的独热编码进行标记化。

架构

该模型由Borzoi主干组成，我们移除了其头部，并将其替换为一个一维U-Net分割头，该分割头由2个下采样卷积块和2个上采样卷积块组成。每个块分别由2个卷积层组成，卷积核数量分别为1024和2048。

📚 引用信息

@article{de2024segmentnt,
  title={SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models},
  author={de Almeida, Bernardo P and Dalla-Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others},
  journal={bioRxiv},
  pages={2024--03},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

开发者：InstaDeep