🚀 nucleotide-transformer-v2-50m-multi-species
The Nucleotide Transformers are a set of foundational language models pre-trained on DNA sequences from whole - genomes. Unlike other approaches, our models not only integrate information from single reference genomes but also leverage DNA sequences from over 3,200 diverse human genomes and 850 genomes from a wide variety of species, including model and non - model organisms. Through comprehensive evaluation, we demonstrate that these large models offer highly accurate molecular phenotype prediction compared to existing methods.
Part of this collection is the nucleotide-transformer-v2-50m-multi-species, a 50M - parameter transformer pre - trained on 850 genomes from a wide range of species, including model and non - model organisms.
Developed by: InstaDeep, NVIDIA and TUM
🚀 Quick Start
Model Sources
How to use
Until its next release, the transformers
library needs to be installed from source with the following command in order to use the models:
pip install --upgrade git+https://github.com/huggingface/transformers.git
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-50m-multi-species", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-50m-multi-species", trust_remote_code=True)
max_length = tokenizer.model_max_length
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]
attention_mask = tokens_ids != tokenizer.pad_token_id
torch_outs = model(
tokens_ids,
attention_mask=attention_mask,
encoder_attention_mask=attention_mask,
output_hidden_states=True
)
embeddings = torch_outs['hidden_states'][-1].detach().numpy()
print(f"Embeddings shape: {embeddings.shape}")
print(f"Embeddings per token: {embeddings}")
attention_mask = torch.unsqueeze(attention_mask, dim=-1)
mean_sequence_embeddings = torch.sum(attention_mask*embeddings, axis=-2)/torch.sum(attention_mask, axis=1)
print(f"Mean sequence embeddings: {mean_sequence_embeddings}")
📚 Documentation
Model Information
Property |
Details |
Model Type |
nucleotide-transformer-v2-50m-multi-species, a 50M parameters transformer pre - trained on 850 genomes from a wide range of species |
Training Data |
850 genomes downloaded from NCBI, excluding plants and viruses. The data represents a total of 174B nucleotides (roughly 29B tokens) and is released as a HuggingFace dataset here |
Datasets |
InstaDeepAI/multi_species_genome, InstaDeepAI/nucleotide_transformer_downstream_tasks |
Tags |
DNA, biology, genomics |
Training procedure
Preprocessing
The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, which tokenizes sequences as 6 - mers tokenizer when possible, otherwise tokenizing each nucleotide separately as described in the Tokenization section of the associated repository. This tokenizer has a vocabulary size of 4105. The inputs of the model are then of the form:
<CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>
The tokenized sequence have a maximum length of 1,000.
The masking procedure used is the standard one for Bert - style training:
- 15% of the tokens are masked.
- In 80% of the cases, the masked tokens are replaced by
[MASK]
.
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
- In the 10% remaining cases, the masked tokens are left as is.
Pretraining
The model was trained with 8 A100 80GB on 300B tokens, with an effective batch size of 1M tokens. The sequence length used was 1000 tokens. The Adam optimizer [38] was used with a learning rate schedule, and standard values for exponential decay rates and epsilon constants, β1 = 0.9, β2 = 0.999 and ε = 1e - 8. During a first warmup period, the learning rate was increased linearly between 5e - 5 and 1e - 4 over 16k steps before decreasing following a square root decay until the end of training.
Architecture
The model belongs to the second generation of nucleotide transformers, with the changes in architecture consisting the use of rotary positional embeddings instead of learned ones, as well as the introduction of Gated Linear Units.
BibTeX entry and citation info
@article{dalla2023nucleotide,
title={The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics},
author={Dalla - Torre, Hugo and Gonzalez, Liam and Mendoza Revilla, Javier and Lopez Carranza, Nicolas and Henryk Grywaczewski, Adam and Oteri, Francesco and Dallago, Christian and Trop, Evan and Sirelkhatim, Hassan and Richard, Guillaume and others},
journal={bioRxiv},
pages={2023--01},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}
📄 License
This model is licensed under cc-by-nc-sa-4.0
.