nucleotide-transformer-500m-human-ref Open-Source Model - Integrating Multi-Genomic Information to Boost Gene Research

Nucleotide Transformer 500m Human Ref

Developed by InstaDeepAI

A 500M-parameter Transformer model pre-trained on the human reference genome, integrating DNA sequence information from over 3,200 diverse human genomes and 850 species

Molecular Model

Transformers

#Genome Pre-training #Multi-species DNA Modeling #High-precision Phenotype Prediction

Downloads 4,482

Release Time : 4/4/2023

Model Overview

The Nucleotide Transformer is a series of foundational language models pre-trained on whole-genome DNA sequences, specializing in genomics to provide accurate molecular phenotype predictions

Model Features

Multi-source Genome Integration

Integrates DNA sequence information from over 3,200 diverse human genomes and 850 species

Large-scale Pre-training

Trained on 300 billion tokens using 8 A100 80GB GPUs

6-mer Tokenization Strategy

Employs 6-mer tokenization with a vocabulary size of 4105 for effective DNA sequence processing

Dual Framework Support

Provides both TensorFlow and PyTorch versions

Model Capabilities

DNA Sequence Analysis

Molecular Phenotype Prediction

Genomic Feature Extraction

DNA Sequence Mask Prediction

Use Cases

Genomics Research

DNA Sequence Feature Extraction

Extracts high-level feature representations from DNA sequences

Applicable to downstream genomics tasks

Molecular Phenotype Prediction

Predicts molecular phenotypes associated with DNA sequences

Provides more accurate predictions compared to existing methods

🚀 nucleotide-transformer-500m-human-ref model

The Nucleotide Transformers are a set of foundational language models pre - trained on DNA sequences from whole - genomes. Unlike other methods, our models not only incorporate information from single reference genomes but also utilize DNA sequences from over 3,200 diverse human genomes and 850 genomes from a wide variety of species, including model and non - model organisms. Through comprehensive and rigorous evaluation, we demonstrate that these large models offer highly accurate molecular phenotype prediction compared to existing approaches.

Part of this collection is the nucleotide - transformer - 500m - human - ref, a 500M - parameter transformer pre - trained on the human reference genome. The model is available in both Tensorflow and Pytorch.

Developed by: InstaDeep, NVIDIA and TUM

🚀 Quick Start

Model Sources

Repository: Nucleotide Transformer
Paper: The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics

How to use

Until its next release, to use the models, the transformers library needs to be installed from source with the following command:

pip install --upgrade git+https://github.com/huggingface/transformers.git

The following is a code snippet to retrieve both logits and embeddings from a dummy DNA sequence:

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Import the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")

# Choose the length to which the input sequences are padded. By default, the 
# model max length is chosen, but feel free to decrease it as the time taken to 
# obtain the embeddings increases significantly with it.
max_length = tokenizer.model_max_length

# Create a dummy dna sequence and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]

# Compute the embeddings
attention_mask = tokens_ids != tokenizer.pad_token_id
torch_outs = model(
    tokens_ids,
    attention_mask=attention_mask,
    encoder_attention_mask=attention_mask,
    output_hidden_states=True
)

# Compute sequences embeddings
embeddings = torch_outs['hidden_states'][-1].detach().numpy()
print(f"Embeddings shape: {embeddings.shape}")
print(f"Embeddings per token: {embeddings}")

# Add embed dimension axis
attention_mask = torch.unsqueeze(attention_mask, dim=-1)

# Compute mean embeddings per sequence
mean_sequence_embeddings = torch.sum(attention_mask*embeddings, axis=-2)/torch.sum(attention_mask, axis=1)
print(f"Mean sequence embeddings: {mean_sequence_embeddings}")

✨ Features

The model integrates DNA sequences from a large number of diverse human genomes and genomes of various species.
Available in both Tensorflow and Pytorch.
Provides highly accurate molecular phenotype prediction.

📦 Installation

Until its next release, install the transformers library from source using the following command:

pip install --upgrade git+https://github.com/huggingface/transformers.git

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Import the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")

# Choose the length to which the input sequences are padded. By default, the 
# model max length is chosen, but feel free to decrease it as the time taken to 
# obtain the embeddings increases significantly with it.
max_length = tokenizer.model_max_length

# Create a dummy dna sequence and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]

# Compute the embeddings
attention_mask = tokens_ids != tokenizer.pad_token_id
torch_outs = model(
    tokens_ids,
    attention_mask=attention_mask,
    encoder_attention_mask=attention_mask,
    output_hidden_states=True
)

# Compute sequences embeddings
embeddings = torch_outs['hidden_states'][-1].detach().numpy()
print(f"Embeddings shape: {embeddings.shape}")
print(f"Embeddings per token: {embeddings}")

# Add embed dimension axis
attention_mask = torch.unsqueeze(attention_mask, dim=-1)

# Compute mean embeddings per sequence
mean_sequence_embeddings = torch.sum(attention_mask*embeddings, axis=-2)/torch.sum(attention_mask, axis=1)
print(f"Mean sequence embeddings: {mean_sequence_embeddings}")

📚 Documentation

Training data

The nucleotide - transformer - 500m - human - ref model was pretrained on the GRCh38 human reference genome, available as a HuggingFace dataset here. It consists of 3B nucleotides, approximately 500M 6 - mers tokens.

Training procedure

Preprocessing

The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer. It tokenizes sequences as 6 - mers when possible; otherwise, it tokenizes each nucleotide separately, as described in the Tokenization section of the associated repository. This tokenizer has a vocabulary size of 4105. The model inputs are in the form:

<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>

The tokenized sequence has a maximum length of 1,000.

The masking procedure for Bert - style training is as follows:

15% of the tokens are masked.
In 80% of the cases, the masked tokens are replaced by [MASK].
In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
In the remaining 10% of cases, the masked tokens are left unchanged.

Pretraining

The model was trained with 8 A100 80GB on 300B tokens, with an effective batch size of 1M tokens. The sequence length used was 1000 tokens. The Adam optimizer was used with a learning rate schedule, and standard values for exponential decay rates and epsilon constants, β1 = 0.9, β2 = 0.999 and ε = 1e - 8. During the first warm - up period, the learning rate was increased linearly from 5e - 5 to 1e - 4 over 16k steps and then decreased following a square root decay until the end of training.

🔧 Technical Details

The Nucleotide Transformer models are pre - trained on a large amount of DNA sequence data. By leveraging sequences from multiple genomes, they can capture more comprehensive genetic information. The tokenization method of using 6 - mers can better represent the genetic features. The Bert - style masking training helps the model learn the context information of DNA sequences. The training with a large number of tokens and appropriate optimizer settings ensures the model's performance.

📄 License

The model is released under the cc - by - nc - sa - 4.0 license.

📖 BibTeX entry and citation info

@article{dalla2023nucleotide,
  title={The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics},
  author={Dalla - Torre, Hugo and Gonzalez, Liam and Mendoza Revilla, Javier and Lopez Carranza, Nicolas and Henryk Grywaczewski, Adam and Oteri, Francesco and Dallago, Christian and Trop, Evan and Sirelkhatim, Hassan and Richard, Guillaume and others},
  journal={bioRxiv},
  pages={2023--01},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

Property	Details
Model Type	Transformer pre - trained on the human reference genome
Training Data	GRCh38 human reference genome, available as InstaDeepAI/human_reference_genome
Datasets	InstaDeepAI/human_reference_genome, InstaDeepAI/nucleotide_transformer_downstream_tasks
Tags	DNA, biology, genomics
Widget	text: ACCTGATTCTGAGTC

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご