SegmentBorzoi Open-source Genomic Segmentation Model - Free Prediction of Single Nucleotide Positions of Multiple Genomic Elements

Segment Borzoi

Developed by InstaDeepAI

SegmentBorzoi is a segmentation model based on Borzoi, designed to predict the locations of various genomic elements at single-nucleotide resolution.

Protein Model #Genome-wide single-nucleotide resolution prediction #DNA sequence segmentation #Multi-category genomic element annotation

Downloads 37

Release Time : 12/24/2024

Model Overview

The model was trained on 14 different categories, including genes (protein-coding genes, lncRNA, 5'UTR, 3'UTR, exons, introns, splice acceptors, and donors) and regulatory elements (polyA signals, tissue-invariant and tissue-specific promoters and enhancers, as well as CTCF binding sites).

Model Features

High-resolution prediction

Capable of predicting genomic element locations at single-nucleotide resolution.

Multi-category training

Trained on 14 different categories of genomic elements, including genes and regulatory elements.

Borzoi-based architecture

Utilizes the Borzoi backbone architecture and replaces it with a 1D U-Net segmentation head to enhance segmentation performance.

Model Capabilities

Genomic element location prediction

DNA sequence analysis

High-resolution segmentation

Use Cases

Genomic research

Gene location prediction

Predicts the locations of genes such as protein-coding genes and lncRNA in DNA sequences.

Regulatory element analysis

Identifies the locations of regulatory elements such as polyA signals, promoters, and enhancers.

🚀 SegmentBorzoi

SegmentBorzoi is a segmentation model that uses Borzoi to predict the location of various types of genomics elements in a sequence at single - nucleotide resolution. It offers value by providing high - precision genomic element prediction, which is crucial for genomics research and analysis.

🚀 Quick Start

Until its next release, to use the models, the transformers library needs to be installed from source. PyTorch, einops, and borzoi_pytorch should also be installed.

pip install --upgrade git+https://github.com/huggingface/transformers.git
pip install torch einops borzoi_pytorch==0.4.0

✨ Features

SegmentBorzoi can predict the location of several types of genomics elements in a sequence at a single nucleotide resolution.
It was trained on 14 different classes, including gene and regulatory elements.
Based on the published implementation of Borzoi, with necessary adjustments for easier implementation and better performance.

📦 Installation

To install the necessary libraries for using SegmentBorzoi, run the following commands:

pip install --upgrade git+https://github.com/huggingface/transformers.git
pip install torch einops borzoi_pytorch==0.4.0

💻 Usage Examples

Basic Usage

The following code snippet shows how to retrieve logits from dummy DNA sequences:

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained("InstaDeepAI/segment_borzoi", trust_remote_code=True)

def encode_sequences(sequences):
    one_hot_map = {
        'a': torch.tensor([1., 0., 0., 0.]),
        'c': torch.tensor([0., 1., 0., 0.]),
        'g': torch.tensor([0., 0., 1., 0.]),
        't': torch.tensor([0., 0., 0., 1.]),
        'n': torch.tensor([0., 0., 0., 0.]),
        'A': torch.tensor([1., 0., 0., 0.]),
        'C': torch.tensor([0., 1., 0., 0.]),
        'G': torch.tensor([0., 0., 1., 0.]),
        'T': torch.tensor([0., 0., 0., 1.]),
        'N': torch.tensor([0., 0., 0., 0.])
    }

    def encode_sequence(seq_str):
        one_hot_list = []
        for char in seq_str:
            one_hot_vector = one_hot_map.get(char, torch.tensor([0.25, 0.25, 0.25, 0.25]))
            one_hot_list.append(one_hot_vector)
        return torch.stack(one_hot_list)

    if isinstance(sequences, list):
        return torch.stack([encode_sequence(seq) for seq in sequences])
    else:
        return encode_sequence(sequences)

sequences = ["A"*524_288, "G"*524_288]
one_hot_encoding = encode_sequences(sequences)
preds = model(one_hot_encoding)
print(preds['logits'])

📚 Documentation

Training data

The SegmentBorzoi model was trained on all human chromosomes except for chromosomes 20 and 21 (kept as test set) and chromosome 22 (used as a validation set). During training, sequences are randomly sampled in the genome with associated annotations. The sequences in the validation and test sets are fixed using a sliding window of length 524kb (original borzoi input length) over chromosomes 20 and 21. The validation set was used to monitor training and for early stopping.

Training procedure

Preprocessing

The DNA sequences are tokenized using one - hot encoding similar to the Enformer model.

Architecture

The model consists of the Borzoi backbone. The original heads are removed and replaced by a 1 - dimensional U - Net segmentation head, which is composed of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these blocks has 2 convolutional layers with 1,024 and 2,048 kernels respectively.

BibTeX entry and citation info

@article{de2024segmentnt,
  title={SegmentNT: annotating the genome at single - nucleotide resolution with DNA foundation models},
  author={de Almeida, Bernardo P and Dalla - Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others},
  journal={bioRxiv},
  pages={2024--03},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

Property	Details
Developed by	InstaDeep
Pipeline tag	feature - extraction
Tags	model_hub_mixin, pytorch_model_hub_mixin

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご