Open-source nucleotide-2.5b model: Multi-species DNA sequence analysis, supporting molecular phenotype prediction!

Nucleotide Transformer 2.5b Multi Species

Developed by InstaDeepAI

A DNA sequence analysis model pre-trained on genomes from 850 species, supporting tasks such as molecular phenotype prediction

Molecular Model

Transformers

#Multi-species genome analysis #DNA sequence prediction #Large-scale pre-training

Downloads 2,714

Release Time : 4/5/2023

Model Overview

This model is a large language model specifically designed for genomics, integrating multi-species DNA sequence data to accurately predict molecular phenotypes. Compared to traditional methods, it offers stronger generalization capabilities and accuracy.

Model Features

Multi-species genome integration

Integrates genome data from 850 species, including model and non-model organisms

Large-scale pre-training

Trained on 300 billion tokens, covering 174 billion nucleotides

Efficient tokenization strategy

Employs a 6-mer prioritized tokenization method with a vocabulary size of 4105

Model Capabilities

DNA sequence analysis

Molecular phenotype prediction

Genomic feature extraction

Masked nucleotide prediction

Use Cases

Genomics research

Regulatory element identification

Identify functional regulatory regions in DNA sequences

Provides more accurate predictions compared to existing methods

Cross-species comparative analysis

Analyze genomic similarities and differences across species

Biomedical research

Disease-associated variant prediction

Predict the impact of DNA sequence variations on diseases

🚀 nucleotide-transformer-2.5b-multi-species model

The Nucleotide Transformers are a set of foundational language models pre-trained on DNA sequences from whole-genomes. Unlike other approaches, our models not only integrate information from single reference genomes but also leverage DNA sequences from over 3,200 diverse human genomes and 850 genomes from a wide variety of species, including model and non-model organisms. Through rigorous and extensive evaluation, we demonstrate that these large models offer extremely accurate molecular phenotype prediction compared to existing methods.

Part of this collection is the nucleotide-transformer-2.5b-multi-species, a 2.5B parameter transformer pre-trained on a collection of 850 genomes from a wide range of species, including model and non-model organisms. The model is available in both Tensorflow and Pytorch.

Developed by: InstaDeep, NVIDIA and TUM

🚀 Quick Start

Model Sources

Repository: Nucleotide Transformer
Paper: The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics

How to use

Until its next release, the transformers library needs to be installed from source with the following command in order to use the models:

pip install --upgrade git+https://github.com/huggingface/transformers.git

✨ Features

The Nucleotide Transformer models integrate information from diverse genomes, including over 3,200 human genomes and 850 genomes from various species. They offer highly accurate molecular phenotype prediction. The nucleotide-transformer-2.5b-multi-species is pre-trained on a collection of 850 genomes and is available in both Tensorflow and Pytorch.

📦 Installation

Until its next release, the transformers library needs to be installed from source with the following command in order to use the models:

pip install --upgrade git+https://github.com/huggingface/transformers.git

💻 Usage Examples

Basic Usage

A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Import the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-2.5b-multi-species")
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-2.5b-multi-species")

# Choose the length to which the input sequences are padded. By default, the 
# model max length is chosen, but feel free to decrease it as the time taken to 
# obtain the embeddings increases significantly with it.
max_length = tokenizer.model_max_length

# Create a dummy dna sequence and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]

# Compute the embeddings
attention_mask = tokens_ids != tokenizer.pad_token_id
torch_outs = model(
    tokens_ids,
    attention_mask=attention_mask,
    encoder_attention_mask=attention_mask,
    output_hidden_states=True
)

# Compute sequences embeddings
embeddings = torch_outs['hidden_states'][-1].detach().numpy()
print(f"Embeddings shape: {embeddings.shape}")
print(f"Embeddings per token: {embeddings}")

# Add embed dimension axis
attention_mask = torch.unsqueeze(attention_mask, dim=-1)

# Compute mean embeddings per sequence
mean_sequence_embeddings = torch.sum(attention_mask*embeddings, axis=-2)/torch.sum(attention_mask, axis=1)
print(f"Mean sequence embeddings: {mean_sequence_embeddings}")

📚 Documentation

Training data

The nucleotide-transformer-2.5b-multi-species model was pretrained on a total of 850 genomes downloaded from NCBI. Plants and viruses are not included in these genomes, as their regulatory elements differ from those of interest in the paper's tasks. Some heavily studied model organisms were picked to be included in the collection of genomes, which represents a total of 174B nucleotides, i.e roughly 29B tokens. The data has been released as a HuggingFace dataset here.

Training procedure

Preprocessing

The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, which tokenizes sequences as 6-mers tokenizer when possible, otherwise tokenizing each nucleotide separately as described in the Tokenization section of the associated repository. This tokenizer has a vocabulary size of 4105. The inputs of the model are then of the form:

<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>

The tokenized sequence have a maximum length of 1,000.

The masking procedure used is the standard one for Bert-style training:

15% of the tokens are masked.
In 80% of the cases, the masked tokens are replaced by [MASK].
In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
In the 10% remaining cases, the masked tokens are left as is.

Pretraining

The model was trained with 128 A100 80GB GPUs on 300B tokens, with an effective batch size of 1M tokens. The sequence length used was 1000 tokens. The Adam optimizer [38] was used with a learning rate schedule, and standard values for exponential decay rates and epsilon constants, β1 = 0.9, β2 = 0.999 and ε=1e-8. During a first warmup period, the learning rate was increased linearly between 5e-5 and 1e-4 over 16k steps before decreasing following a square root decay until the end of training.

🔧 Technical Details

The Nucleotide Transformer models are pre-trained on a large number of genomes from diverse species. The tokenization process uses a Nucleotide Transformer Tokenizer with a vocabulary size of 4105. The masking procedure follows the standard Bert-style training. The model was trained with 128 A100 80GB GPUs on 300B tokens with specific hyperparameters for the Adam optimizer and learning rate schedule.

📄 License

This model is released under the cc-by-nc-sa-4.0 license.

BibTeX entry and citation info

@article{dalla2023nucleotide,
  title={The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics},
  author={Dalla-Torre, Hugo and Gonzalez, Liam and Mendoza Revilla, Javier and Lopez Carranza, Nicolas and Henryk Grywaczewski, Adam and Oteri, Francesco and Dallago, Christian and Trop, Evan and Sirelkhatim, Hassan and Richard, Guillaume and others},
  journal={bioRxiv},
  pages={2023--01},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

Property	Details
Model Type	nucleotide-transformer-2.5b-multi-species
Training Data	850 genomes downloaded from NCBI, representing 174B nucleotides (roughly 29B tokens). Data released as InstaDeepAI/multi_species_genomes

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご