SegmentNT-multi-species Open-source Genomic Segmentation Model - Accurately Predict the Locations of Multiple Genomic Elements

Segment Nt Multi Species

Developed by InstaDeepAI

SegmentNT-multi-species is a segmentation model based on the Nucleotide Transformer, designed to predict the locations of various genomic elements at single-nucleotide resolution.

Protein Model

Transformers

#Genomic element segmentation #Multi-species DNA analysis #Single-nucleotide resolution

Downloads 102

Release Time : 3/5/2024

Model Overview

This model is fine-tuned from the SegmentNT model using genomic datasets from humans and five selected species (mouse, chicken, fruit fly, zebrafish, and nematode), enabling it to predict the locations of seven major genomic elements.

Model Features

Multi-species support

Supports genomic analysis for humans and five other species (mouse, chicken, fruit fly, zebrafish, and nematode).

High-resolution segmentation

Capable of predicting genomic element locations at single-nucleotide resolution.

Efficient training

Fine-tuned for 3 days on DGXH100 nodes using 8 GPUs, processing a total of 8 billion tokens.

Model Capabilities

Genomic element prediction

DNA sequence analysis

Multi-species genomic segmentation

Use Cases

Genomic research

Genomic element localization

Predicts the locations of gene elements such as exons and introns in DNA sequences.

Accurately identifies the locations of seven major gene elements.

Cross-species comparison

Analyzes similarities and differences in genomic elements across different species.

🚀 Segment-nt-multi-species

SegmentNT-multi-species is a segmentation model that uses the Nucleotide Transformer DNA foundation model to predict the location of various genomic elements in a sequence at single-nucleotide resolution. It is the result of finetuning the SegmentNT model on datasets from multiple species.

🚀 Quick Start

Until its next release, the transformers library needs to be installed from source with the following command in order to use the models:

pip install --upgrade git+https://github.com/huggingface/transformers.git

A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.

⚠️ Important Note

The maximum sequence length is set by default at the training length of 30,000 nucleotides, or 5001 tokens (accounting for the CLS token). However, SegmentNT has been shown to generalize up to sequences of 50,000 bp. In case you need to infer on sequences between 30kbp and 50kbp, make sure to change the rescaling_factor argument in the config to num_dna_tokens_inference / max_num_tokens_nt where num_dna_tokens_inference is the number of tokens at inference (i.e 6669 for a sequence of 40008 base pairs) and max_num_tokens_nt is the max number of tokens on which the backbone nucleotide-transformer was trained on, i.e 2048.

# Load model and tokenizer
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)

# Choose the length to which the input sequences are padded. By default, the 
# model max length is chosen, but feel free to decrease it as the time taken to 
# obtain the embeddings increases significantly with it.
# The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by
# 2 to the power of the number of downsampling block, i.e 4.
max_length = 12 + 1

assert (max_length - 1) % 4 == 0, (
    "The number of DNA tokens (excluding the CLS token prepended) needs to be dividible by"
     "2 to the power of the number of downsampling block, i.e 4.")

# Create a dummy dna sequence and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]

# Infer
attention_mask = tokens != tokenizer.pad_token_id
outs = model(
    tokens,
    attention_mask=attention_mask,
    output_hidden_states=True
)

# Obtain the logits over the genomic features
logits = outs.logits.detach()
# Transform them in probabilities
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print(f"Probabilities shape: {probabilities.shape}")

# Get probabilities associated with intron
idx_intron = model.config.features.index("intron")
probabilities_intron = probabilities[:,:,idx_intron]
print(f"Intron probabilities shape: {probabilities_intron.shape}")

✨ Features

SegmentNT-multi-species is a segmentation model leveraging the Nucleotide Transformer (NT) DNA foundation model to predict the location of several types of genomics elements in a sequence at a single nucleotide resolution. It is the result of finetuning the SegmentNT model on a dataset encompassing the human genome but also the genomes of 5 selected species: mouse, chicken, fly, zebrafish and worm.

For the finetuning on the multi-species genomes, a dataset of a subset of the annotations used to train SegmentNT was curated, mainly because only this subset of annotations is available for these species. The annotations therefore concern the 7 main gene elements available from Ensembl, namely protein-coding gene, 5’UTR, 3’UTR, intron, exon, splice acceptor and donor sites.

📦 Installation

pip install --upgrade git+https://github.com/huggingface/transformers.git

📚 Documentation

Model Sources

Repository: Nucleotide Transformer
Paper: Segmenting the genome at single-nucleotide resolution with DNA foundation models

Training data

The segment-nt-multi-species model was finetuned on human, mouse, chicken, fly, zebrafish and worm genomes. For each specie, a subset of chromosomes is kept as validation for training monitoring and test for final evaluation.

Training procedure

Preprocessing

The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, which tokenizes sequences as 6-mers tokens as described in the Tokenization section of the associated repository. This tokenizer has a vocabulary size of 4105. The inputs of the model are then of the form:

<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>

Training

The model was finetuned on a DGXH100 node with 8 GPUs on a total of 8B tokens for 3 days.

Architecture

The model is composed of the nucleotide-transformer-v2-500m-multi-species encoder, from which the language model head was removed and replaced it by a 1-dimensional U-Net segmentation head [4] made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively. This additional segmentation head accounts for 53 million parameters, bringing the total number of parameters to 562M.

📄 License

This project is licensed under the CC BY-NC-SA 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご