agro - nucleotide - transformer - 1b Open - Source DNA Language Model - Learning Universal Representations of Edible Plant Nucleotide Sequences

Agro Nucleotide Transformer 1b

Developed by InstaDeepAI

AgroNT is a DNA language model trained on edible plant genomes, capable of learning universal representations of nucleotide sequences.

Molecular Model

Transformers

#Plant genome modeling #6-mer tokenization #Large-scale DNA pre-training

Downloads 4,869

Release Time : 8/1/2023

Model Overview

AgroNT is a DNA language model primarily trained on edible plant genomes, utilizing a Transformer architecture to learn universal representations of nucleotide sequences through masked language modeling objectives.

Model Features

Large-scale genome training

The model is trained using high-availability genotype data from 48 different plant species, covering approximately 10.5 million genomic sequences.

6-mer tokenization

Uses a non-overlapping 6-mer tokenizer to convert genomic nucleotide sequences into tokens, with a vocabulary containing 4096 possible 6-mer combinations.

Long context window

The model supports a context window of 1024 tokens, corresponding to approximately 6144 base pairs.

Efficient pre-training

Pre-training utilizes an effective batch size of 1.5 million tokens, with a total of 315,000 update steps, amounting to 472.5 billion tokens trained in total.

Model Capabilities

Genomic sequence representation learning

Masked nucleotide prediction

Genomic sequence embedding generation

Use Cases

Genomics research

Plant genome analysis

Utilizes the model to learn universal representations of plant genomes, aiding in genome analysis and comparison.

Genomic sequence prediction

Predicts masked portions of genomic sequences, assisting in genome sequencing and annotation.

🚀 AgroNT: A DNA Language Model for Plant Genomics

AgroNT is a DNA language model primarily trained on edible plant genomes. It uses the transformer architecture with self - attention and masked language modeling to learn general nucleotide sequence representations from 48 plant species.

📄 License

This project is licensed under the CC BY - NC - SA 4.0 license.

📦 Datasets

InstaDeepAI/plant - genomic - benchmark

🏷️ Tags

biology
genomics
language model
plants

🚀 Quick Start

✨ Features

Transformer Architecture: AgroNT uses the transformer architecture with self - attention, enabling it to capture complex relationships in nucleotide sequences.
Masked Language Modeling: It employs masked language modeling to learn from highly available genotype data.
Large Parameter Count: With 1 billion parameters and a context window of 1024 tokens, AgroNT can handle relatively long nucleotide sequences.
6 - mer Tokenizer: A non - overlapping 6 - mer tokenizer is used, where 1024 tokens correspond to approximately 6144 base pairs.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch


model_name = 'agro - nucleotide - transformer - 1b'

# fetch model and tokenizer from InstaDeep's hf repo
agro_nt_model = AutoModelForMaskedLM.from_pretrained(f'InstaDeepAI/{model_name}')
agro_nt_tokenizer = AutoTokenizer.from_pretrained(f'InstaDeepAI/{model_name}')

print(f"Loaded the {model_name} model with {agro_nt_model.num_parameters()} parameters and corresponding tokenizer.")

# example sequence and tokenization
sequences = ['ATATACGGCCGNC','GGGTATCGCTTCCGAC']

batch_tokens = agro_nt_tokenizer(sequences,padding="longest")['input_ids']
print(f"Tokenzied sequence: {agro_nt_tokenizer.batch_decode(batch_tokens)}")

torch_batch_tokens = torch.tensor(batch_tokens)
attention_mask = torch_batch_tokens != agro_nt_tokenizer.pad_token_id

# inference
outs = agro_nt_model(
    torch_batch_tokens,
    attention_mask=attention_mask,
    encoder_attention_mask=attention_mask,
    output_hidden_states=True
)

# get the final layer embeddings and language model head logits
embeddings = outs['hidden_states'][-1].detach().numpy()
logits = outs['logits'].detach().numpy()

📚 Documentation

Pre - training

Data

The pre - training dataset was built from (mostly) edible plants reference genomes in the Ensembl Plants database. It consists of about 10.5 million genomic sequences from 48 different species.

Processing

All reference genomes for each species were assembled into a single fasta file. In this file, nucleotides other than A, T, C, G were replaced by N. A tokenizer was used to convert nucleotide sequences to tokens. The tokenizer's alphabet includes 4096 possible 6 - mer combinations of A, T, C, G, five standalone tokens for A, T, C, G, N, and three special tokens: [PAD], [MASK], and [CLS]. This results in a vocabulary of 4104 tokens.

Tokenization example nucleotide sequence: ATCCCGGNNTCGACACN tokens: <CLS> <ATCCCG> <G> <N> <N> <TCGACA> <C> <N>

Training

The MLM objective was used for self - supervised pre - training. 15% of the tokens in the input sequence are selected for augmentation: 80% are replaced with a mask token, 10% are randomly replaced by another token from the vocabulary, and 10% remain the same. The model was pre - trained with a sequence length of 1024 tokens and an effective batch size of 1.5M tokens for 315k update steps, totaling 472.5B tokens.

Hardware

Model pre - training was carried out using Google TPU - V4 accelerators, specifically a TPU v4 - 1024 with 512 devices, and it took approximately four days.

📖 BibTeX entry and citation info

@article{mendoza2023foundational,
  title={A Foundational Large Language Model for Edible Plant Genomes},
  author={Mendoza - Revilla, Javier and Trop, Evan and Gonzalez, Liam and Roller, Masa and Dalla - Torre, Hugo and de Almeida, Bernardo P and Richard, Guillaume and Caton, Jonathan and Lopez Carranza, Nicolas and Skwark, Marcin and others},
  journal={bioRxiv},
  pages={2023--10},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご