The open-source nucleotide-transformer-v2-50m-3mer-multi-species model - DNA sequence analysis for protein task prediction

Home

Nucleotide Transformer V2 50m 3mer Multi Species

Developed by InstaDeepAI

DNA sequence foundation language model pre-trained on 850 species genomes, specializing in protein task prediction

Protein Model

Transformers

#Multi-species genome pre-training #3mer tokenization optimization #Protein task prediction

Downloads 33

Release Time : 5/8/2024

Model Overview

This model integrates multi-species genomic data (including over 3,200 human genomes and 850 diverse species) to provide high-precision molecular phenotype prediction capabilities, specifically optimized for downstream protein tasks

Model Features

Multi-species genome integration

Pre-training data covers 850 species (including model and non-model organisms), breaking through the limitations of single reference genomes

3mer tokenization optimization

Adopts 3mer tokenization strategy to enhance fine-grained protein prediction capability with a vocabulary size of 4,105

Enhanced architecture design

Uses rotary position encoding instead of traditional learned encoding and introduces gated linear units to improve model performance

Large-scale pre-training

Trained on 174 billion nucleotides (29 billion tokens) with large-scale batch processing of 1 million tokens

Model Capabilities

DNA sequence embedding generation

Masked nucleotide prediction

Protein function inference

Genomic feature extraction

Use Cases

Genomics research

Conserved sequence analysis

Identify evolutionarily conserved regions through cross-species sequence alignment

Can detect homologous sequences in distantly related species that are difficult to identify with traditional methods

Protein-coding region prediction

Predict potential protein-coding regions based on DNA sequences

Excellent performance on the InstaDeepAI/true-cds-protein-tasks dataset

Biomedical applications

Disease-associated variant detection

Identify DNA variants that may cause protein dysfunction

Significantly improved prediction sensitivity for non-coding region variants

🚀 Nucleotide Transformer V2 50M Multi-Species

The Nucleotide Transformers are a set of foundational language models pre - trained on DNA sequences from whole - genomes. Unlike other methods, these models not only incorporate information from single reference genomes but also utilize DNA sequences from over 3,200 diverse human genomes and 850 genomes from a wide variety of species, including model and non - model organisms. Through comprehensive evaluation, we demonstrate that these large models offer highly accurate molecular phenotype prediction compared to existing approaches.

Part of this collection is the nucleotide - transformer - v2 - 50m - 3mer - multi - species, a 50M - parameter transformer pre - trained on a collection of 850 genomes from a wide range of species, including model and non - model organisms.

This model was developed as part of an effort to assess the capabilities of genomic language models for proteins. In this work, 3mer tokenization was considered as a potential architectural change to enhance fine - grained downstream protein prediction.

Developed by: InstaDeep

🚀 Quick Start

Model Sources

Repository: Nucleotide Transformer
Paper: Are Genomic Language Models All You Need? Exploring Genomic Language Models on Protein Downstream Tasks
Nucleotide Transformer Paper: The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics

How to use

Until its next release, the transformers library needs to be installed from source with the following command to use the models:

pip install --upgrade git+https://github.com/huggingface/transformers.git

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Import the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-50m-3mer-multi-species", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-50m-3mer-multi-species", trust_remote_code=True)

# Choose the length to which the input sequences are padded. By default, the 
# model max length is chosen, but feel free to decrease it as the time taken to 
# obtain the embeddings increases significantly with it.
max_length = tokenizer.model_max_length

# Create a dummy dna sequence and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]

# Compute the embeddings
attention_mask = tokens_ids != tokenizer.pad_token_id
torch_outs = model(
    tokens_ids,
    attention_mask=attention_mask,
    encoder_attention_mask=attention_mask,
    output_hidden_states=True
)

# Compute sequences embeddings
embeddings = torch_outs['hidden_states'][-1].detach().numpy()
print(f"Embeddings shape: {embeddings.shape}")
print(f"Embeddings per token: {embeddings}")

# Add embed dimension axis
attention_mask = torch.unsqueeze(attention_mask, dim=-1)

# Compute mean embeddings per sequence
mean_sequence_embeddings = torch.sum(attention_mask*embeddings, axis=-2)/torch.sum(attention_mask, axis=1)
print(f"Mean sequence embeddings: {mean_sequence_embeddings}")

📚 Documentation

Model Information

Property	Details
Model Type	Nucleotide Transformer V2 50M Multi - Species
Training Data	850 genomes from a wide range of species (excluding plants and viruses), downloaded from NCBI, representing about 174B nucleotides (roughly 29B tokens). The data is released as a HuggingFace dataset here.

Training data

The nucleotide - transformer - v2 - 50m - 3mer - multi - species model was pretrained on a total of 850 genomes downloaded from NCBI. Plants and viruses are not included in these genomes as their regulatory elements differ from those of interest in the paper's tasks. Some well - studied model organisms were included in the genome collection.

Training procedure

Preprocessing

The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer. It tokenizes sequences as 6 - mers when possible; otherwise, it tokenizes each nucleotide separately, as described in the Tokenization section of the associated repository. The tokenizer has a vocabulary size of 4105. The model inputs are in the form:

<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>

The tokenized sequence has a maximum length of 1,000.

The masking procedure follows the standard Bert - style training:

15% of the tokens are masked.
In 80% of the cases, the masked tokens are replaced by [MASK].
In 10% of the cases, the masked tokens are replaced by a random token different from the original one.
In the remaining 10% of the cases, the masked tokens remain unchanged.

Pretraining

The model was trained with 64 TPUv4s on 300B tokens, with an effective batch size of 1M tokens. The sequence length used was 1000 tokens. The Adam optimizer was used with a learning rate schedule, and standard values for exponential decay rates and epsilon constants (β1 = 0.9, β2 = 0.999, ε = 1e - 8). During the first warm - up period, the learning rate increased linearly from 5e - 5 to 1e - 4 over 16k steps and then decreased following a square - root decay until the end of training.

Architecture

The model belongs to the second generation of nucleotide transformers. The architectural changes include using rotary positional embeddings instead of learned ones and introducing Gated Linear Units.

BibTeX entry and citation info

@article{boshar2024glmsforproteins,
  title={Are Genomic Language Models All You Need? Exploring Genomic Language Models on Protein Downstream Tasks},
  author={Boshar, Sam and Trop, Evan and de Almeida, Bernardo and Copoiu, Liviu and Pierrot, Thomas},
  journal={bioRxiv},
  pages={2024--01},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

📄 License

The model is licensed under CC - BY - NC - SA - 4.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご