SegmentNT Open-source DNA Segmentation Model - Free Prediction of Positions of Multiple Genomic Elements in Sequences

Segment Nt

Developed by InstaDeepAI

SegmentNT is a DNA segmentation model based on Nucleotide Transformer, capable of predicting the positions of multiple genomic elements in a sequence at single nucleotide resolution.

Molecular Model

Transformers

#Single nucleotide resolution #Genome segmentation #DNA foundation model

Downloads 546

Release Time : 3/4/2024

Model Overview

SegmentNT is a segmentation model for DNA foundation models, capable of predicting the positions of 14 different types of genomic elements, including genes and regulatory elements, on human genomic input sequences up to 30kb in length.

Model Features

High-resolution segmentation

Capable of predicting the positions of genomic elements at single nucleotide resolution

Long sequence processing

Can process DNA sequences up to 30kb in length and can be extended to 50kb

Multi-element prediction

Can predict 14 different types of genomic elements, including genes and regulatory elements

Model Capabilities

DNA sequence segmentation

Genomic element prediction

Long sequence processing

Use Cases

Genomics research

Gene structure prediction

Predict gene structure elements such as protein-coding genes and non-coding RNAs

High-precision single nucleotide resolution prediction

Regulatory element identification

Identify regulatory elements such as promoters and enhancers

Can distinguish between tissue-specific and tissue-invariant regulatory elements

🚀 SegmentNT

SegmentNT is a segmentation model that uses the Nucleotide Transformer (NT) DNA foundation model to predict the location of various types of genomics elements in a sequence at single - nucleotide resolution. It was trained on 14 different classes of human genomics elements in input sequences up to 30kb, including gene (protein - coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue - invariant and tissue - specific promoters and enhancers, and CTCF - bound sites) elements.

Developed by: InstaDeep

🚀 Quick Start

Model Sources

Repository: Nucleotide Transformer
Paper: Segmenting the genome at single - nucleotide resolution with DNA foundation models

How to use

Until its next release, the transformers library needs to be installed from source with the following command to use the models:

pip install --upgrade git+https://github.com/huggingface/transformers.git

A small code snippet is provided here to retrieve both logits and embeddings from a dummy DNA sequence.

⚠️ Important Note

The maximum sequence length is set by default at the training length of 30,000 nucleotides, or 5001 tokens (accounting for the CLS token). However, SegmentNT - multi - species has been shown to generalize up to sequences of 50,000 bp. In case you need to infer on sequences between 30kbp and 50kbp, make sure to change the rescaling_factor of the Rotary Embedding layer in the esm model num_dna_tokens_inference / max_num_tokens_nt where num_dna_tokens_inference is the number of tokens at inference (i.e 6669 for a sequence of 40008 base pairs) and max_num_tokens_nt is the max number of tokens on which the backbone nucleotide - transformer was trained on, i.e 2048.

The ./inference_segment_nt.ipynb can be run in Google Colab by clicking on the icon and shows how to handle inference on sequence lengths that require changing the rescaling factor and those that do not. One can run the notebook and reproduce Fig.1 and Fig.3 from the SegmentNT paper.

💻 Usage Examples

Basic Usage

# Load model and tokenizer
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)

# Choose the length to which the input sequences are padded. By default, the 
# model max length is chosen, but feel free to decrease it as the time taken to 
# obtain the embeddings increases significantly with it.
# The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by
# 2 to the power of the number of downsampling block, i.e 4.
max_length = 12 + 1

assert (max_length - 1) % 4 == 0, (
    "The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by"
     "2 to the power of the number of downsampling block, i.e 4.")

# Create a dummy dna sequence and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]

# Infer
attention_mask = tokens != tokenizer.pad_token_id
outs = model(
    tokens,
    attention_mask=attention_mask,
    output_hidden_states=True
)

# Obtain the logits over the genomic features
logits = outs.logits.detach()
# Transform them in probabilities
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print(f"Probabilities shape: {probabilities.shape}")

# Get probabilities associated with intron
idx_intron = model.config.features.index("intron")
probabilities_intron = probabilities[:,:,idx_intron]
print(f"Intron probabilities shape: {probabilities_intron.shape}")

📚 Documentation

Training data

The SegmentNT model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set. During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by using a sliding window of length 30,000 over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping.

Training procedure

Preprocessing

The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, which tokenizes sequences as 6 - mers tokens as described in the Tokenization section of the associated repository. This tokenizer has a vocabulary size of 4105. The inputs of the model are then of the form:

<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>

Training

The model was trained on a DGXH100 node with 8 GPUs on a total of 23B tokens for 3 days. The model was trained on 3kb, 10kb, 20kb and finally 30kb sequences, at each time with an effective batch size of 256 sequences.

Architecture

The model is composed of the nucleotide - transformer - v2 - 500m - multi - species encoder, from which we removed the language model head and replaced it by a 1 - dimensional U - Net segmentation head [4] made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively. This additional segmentation head accounts for 53 million parameters, bringing the total number of parameters to 562M.

BibTeX entry and citation info

@article{de2024segmentnt,
  title={SegmentNT: annotating the genome at single - nucleotide resolution with DNA foundation models},
  author={de Almeida, Bernardo P and Dalla - Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others},
  journal={bioRxiv},
  pages={2024--03},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

📄 License

The license for this project is cc - by - nc - sa - 4.0.

Property	Details
Model Type	Segmentation model leveraging the Nucleotide Transformer DNA foundation model
Training Data	All human chromosomes except for chromosomes 20 and 21 (test set), and chromosome 22 (validation set)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご