🚀 SegmentBorzoi
SegmentBorzoi is a segmentation model that uses Borzoi to predict the location of various types of genomics elements in a sequence at single - nucleotide resolution. It offers value by providing high - precision genomic element prediction, which is crucial for genomics research and analysis.
🚀 Quick Start
Until its next release, to use the models, the transformers library needs to be installed from source. PyTorch, einops, and borzoi_pytorch should also be installed.
pip install --upgrade git+https://github.com/huggingface/transformers.git
pip install torch einops borzoi_pytorch==0.4.0
✨ Features
- SegmentBorzoi can predict the location of several types of genomics elements in a sequence at a single nucleotide resolution.
- It was trained on 14 different classes, including gene and regulatory elements.
- Based on the published implementation of Borzoi, with necessary adjustments for easier implementation and better performance.
📦 Installation
To install the necessary libraries for using SegmentBorzoi, run the following commands:
pip install --upgrade git+https://github.com/huggingface/transformers.git
pip install torch einops borzoi_pytorch==0.4.0
💻 Usage Examples
Basic Usage
The following code snippet shows how to retrieve logits from dummy DNA sequences:
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained("InstaDeepAI/segment_borzoi", trust_remote_code=True)
def encode_sequences(sequences):
one_hot_map = {
'a': torch.tensor([1., 0., 0., 0.]),
'c': torch.tensor([0., 1., 0., 0.]),
'g': torch.tensor([0., 0., 1., 0.]),
't': torch.tensor([0., 0., 0., 1.]),
'n': torch.tensor([0., 0., 0., 0.]),
'A': torch.tensor([1., 0., 0., 0.]),
'C': torch.tensor([0., 1., 0., 0.]),
'G': torch.tensor([0., 0., 1., 0.]),
'T': torch.tensor([0., 0., 0., 1.]),
'N': torch.tensor([0., 0., 0., 0.])
}
def encode_sequence(seq_str):
one_hot_list = []
for char in seq_str:
one_hot_vector = one_hot_map.get(char, torch.tensor([0.25, 0.25, 0.25, 0.25]))
one_hot_list.append(one_hot_vector)
return torch.stack(one_hot_list)
if isinstance(sequences, list):
return torch.stack([encode_sequence(seq) for seq in sequences])
else:
return encode_sequence(sequences)
sequences = ["A"*524_288, "G"*524_288]
one_hot_encoding = encode_sequences(sequences)
preds = model(one_hot_encoding)
print(preds['logits'])
📚 Documentation
Training data
The SegmentBorzoi model was trained on all human chromosomes except for chromosomes 20 and 21 (kept as test set) and chromosome 22 (used as a validation set). During training, sequences are randomly sampled in the genome with associated annotations. The sequences in the validation and test sets are fixed using a sliding window of length 524kb (original borzoi input length) over chromosomes 20 and 21. The validation set was used to monitor training and for early stopping.
Training procedure
Preprocessing
The DNA sequences are tokenized using one - hot encoding similar to the Enformer model.
Architecture
The model consists of the Borzoi backbone. The original heads are removed and replaced by a 1 - dimensional U - Net segmentation head, which is composed of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these blocks has 2 convolutional layers with 1,024 and 2,048 kernels respectively.
BibTeX entry and citation info
@article{de2024segmentnt,
title={SegmentNT: annotating the genome at single - nucleotide resolution with DNA foundation models},
author={de Almeida, Bernardo P and Dalla - Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others},
journal={bioRxiv},
pages={2024--03},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}
Property |
Details |
Developed by |
InstaDeep |
Pipeline tag |
feature - extraction |
Tags |
model_hub_mixin, pytorch_model_hub_mixin |