Nucleotide-transformer-2.5b-1000g Open-source DNA Model - Accurately Predict Molecular Phenotypes and Assist Gene Research

Nucleotide Transformer 2.5b 1000g

Developed by InstaDeepAI

A 2.5 billion-parameter DNA sequence foundation model pre-trained on 3,202 genetically diverse human genomes, capable of precise molecular phenotype prediction

Molecular Model

Transformers

#Whole Genome Pre-training #Multi-population DNA Modeling #2.5 Billion Parameter Scale

Downloads 122

Release Time : 4/4/2023

Model Overview

The Nucleotide Transformer is a pre-trained language model specifically designed for whole-genome DNA sequences, integrating human and multi-species genomic data, demonstrating exceptional performance in molecular phenotype prediction

Model Features

Multi-source Genome Pre-training

Integrates data from 3,200+ human genomes and 850+ species, covering extensive genetic diversity

Efficient Tokenization Strategy

Employs a 6-mer prioritized tokenization method, balancing sequence information density with computational efficiency

Large-scale Parameters

2.5 billion parameter scale enables capturing complex genomic feature patterns

Model Capabilities

DNA Sequence Embedding Generation

Genomic Variant Prediction

Molecular Phenotype Inference

Masked Nucleotide Prediction

Use Cases

Genomics Research

Genetic Variation Analysis

Identify functional genomic regions through sequence embeddings

Significantly improves variant effect prediction accuracy compared to traditional methods

Cross-species Comparison

Analyze conserved regions using multi-species pre-trained features

Biomedical Applications

Disease Risk Prediction

Disease association studies based on whole-genome sequences

🚀 nucleotide-transformer-2.5b-1000g model

The Nucleotide Transformers are pre - trained on diverse DNA sequences from whole - genomes, offering highly accurate molecular phenotype prediction.

🚀 Quick Start

The Nucleotide Transformers are a set of foundational language models pre - trained on DNA sequences from whole - genomes. Unlike other methods, these models leverage DNA sequences from over 3,200 diverse human genomes and 850 genomes from various species. Through comprehensive evaluation, they provide extremely accurate molecular phenotype prediction compared to existing methods.

The nucleotide - transformer - 2.5b - 1000g is part of this collection. It's a 2.5B parameters transformer pre - trained on 3202 genetically diverse human genomes and is available in both Tensorflow and Pytorch.

Developed by: InstaDeep, NVIDIA and TUM

✨ Features

Integrates information from over 3,200 diverse human genomes and 850 genomes from a wide range of species.
Provides highly accurate molecular phenotype prediction.
Available in both Tensorflow and Pytorch.

📦 Installation

Until its next release, the transformers library needs to be installed from source with the following command in order to use the models:

pip install --upgrade git+https://github.com/huggingface/transformers.git

💻 Usage Examples

Basic Usage

A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Import the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-2.5b-1000g")
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-2.5b-1000g")

# Choose the length to which the input sequences are padded. By default, the 
# model max length is chosen, but feel free to decrease it as the time taken to 
# obtain the embeddings increases significantly with it.
max_length = tokenizer.model_max_length

# Create a dummy dna sequence and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]

# Compute the embeddings
attention_mask = tokens_ids != tokenizer.pad_token_id
torch_outs = model(
    tokens_ids,
    attention_mask=attention_mask,
    encoder_attention_mask=attention_mask,
    output_hidden_states=True
)

# Compute sequences embeddings
embeddings = torch_outs['hidden_states'][-1].detach().numpy()
print(f"Embeddings shape: {embeddings.shape}")
print(f"Embeddings per token: {embeddings}")

# Add embed dimension axis
attention_mask = torch.unsqueeze(attention_mask, dim=-1)

# Compute mean embeddings per sequence
mean_sequence_embeddings = torch.sum(attention_mask*embeddings, axis=-2)/torch.sum(attention_mask, axis=1)
print(f"Mean sequence embeddings: {mean_sequence_embeddings}")

📚 Documentation

Model Sources

Repository: Nucleotide Transformer
Paper: The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics

Training data

The nucleotide - transformer - 2.5b - 1000g model was pretrained on 3202 genetically diverse human genomes from 27 geographically structured populations of African, American, East Asian, and European ancestry taken from the 1000G project. The dataset encodes a better representation of human genetic variation. The phased version of the 1000G Genomes project was considered, with a total of 125M mutations (111M SNPs and 14M indels). The dataset has 19,212 B nucleotides, resulting in roughly 3,202 B tokens.

Training procedure

Preprocessing

The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, which tokenizes sequences as 6 - mers when possible, otherwise tokenizing each nucleotide separately as described in the Tokenization section of the associated repository. This tokenizer has a vocabulary size of 4105. The inputs of the model are then of the form:

<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>

The tokenized sequence have a maximum length of 1,000.

The masking procedure used is the standard one for Bert - style training:

15% of the tokens are masked.
In 80% of the cases, the masked tokens are replaced by [MASK].
In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
In the 10% remaining cases, the masked tokens are left as is.

Pretraining

The model was trained with 128 A100 80GB on 300B tokens, with an effective batch size of 1M tokens. The sequence length used was 1000 tokens. The Adam optimizer [38] was used with a learning rate schedule, and standard values for exponential decay rates and epsilon constants, β1 = 0.9, β2 = 0.999 and ε = 1e - 8. During a first warmup period, the learning rate was increased linearly between 5e - 5 and 1e - 4 over 16k steps before decreasing following a square root decay until the end of training.

📄 License

This model is licensed under cc - by - nc - sa - 4.0.

🔧 Technical Details

BibTeX entry and citation info

@article{dalla2023nucleotide,
  title={The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics},
  author={Dalla - Torre, Hugo and Gonzalez, Liam and Mendoza Revilla, Javier and Lopez Carranza, Nicolas and Henryk Grywaczewski, Adam and Oteri, Francesco and Dallago, Christian and Trop, Evan and Sirelkhatim, Hassan and Richard, Guillaume and others},
  journal={bioRxiv},
  pages={2023--01},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご