đ AgroNT: A DNA Language Model for Plant Genomics
AgroNT is a DNA language model primarily trained on edible plant genomes. It uses the transformer architecture with self - attention and masked language modeling to learn general nucleotide sequence representations from 48 plant species.
đ License
This project is licensed under the CC BY - NC - SA 4.0 license.
đĻ Datasets
- InstaDeepAI/plant - genomic - benchmark
đˇī¸ Tags
- biology
- genomics
- language model
- plants
đ Quick Start
⨠Features
- Transformer Architecture: AgroNT uses the transformer architecture with self - attention, enabling it to capture complex relationships in nucleotide sequences.
- Masked Language Modeling: It employs masked language modeling to learn from highly available genotype data.
- Large Parameter Count: With 1 billion parameters and a context window of 1024 tokens, AgroNT can handle relatively long nucleotide sequences.
- 6 - mer Tokenizer: A non - overlapping 6 - mer tokenizer is used, where 1024 tokens correspond to approximately 6144 base pairs.
đģ Usage Examples
Basic Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
model_name = 'agro - nucleotide - transformer - 1b'
agro_nt_model = AutoModelForMaskedLM.from_pretrained(f'InstaDeepAI/{model_name}')
agro_nt_tokenizer = AutoTokenizer.from_pretrained(f'InstaDeepAI/{model_name}')
print(f"Loaded the {model_name} model with {agro_nt_model.num_parameters()} parameters and corresponding tokenizer.")
sequences = ['ATATACGGCCGNC','GGGTATCGCTTCCGAC']
batch_tokens = agro_nt_tokenizer(sequences,padding="longest")['input_ids']
print(f"Tokenzied sequence: {agro_nt_tokenizer.batch_decode(batch_tokens)}")
torch_batch_tokens = torch.tensor(batch_tokens)
attention_mask = torch_batch_tokens != agro_nt_tokenizer.pad_token_id
outs = agro_nt_model(
torch_batch_tokens,
attention_mask=attention_mask,
encoder_attention_mask=attention_mask,
output_hidden_states=True
)
embeddings = outs['hidden_states'][-1].detach().numpy()
logits = outs['logits'].detach().numpy()
đ Documentation
Pre - training
Data
The pre - training dataset was built from (mostly) edible plants reference genomes in the Ensembl Plants database. It consists of about 10.5 million genomic sequences from 48 different species.
Processing
All reference genomes for each species were assembled into a single fasta file. In this file, nucleotides other than A, T, C, G were replaced by N. A tokenizer was used to convert nucleotide sequences to tokens. The tokenizer's alphabet includes 4096 possible 6 - mer combinations of A, T, C, G, five standalone tokens for A, T, C, G, N, and three special tokens: [PAD], [MASK], and [CLS]. This results in a vocabulary of 4104 tokens.
Tokenization example
nucleotide sequence: ATCCCGGNNTCGACACN
tokens: <CLS> <ATCCCG> <G> <N> <N> <TCGACA> <C> <N>
Training
The MLM objective was used for self - supervised pre - training. 15% of the tokens in the input sequence are selected for augmentation: 80% are replaced with a mask token, 10% are randomly replaced by another token from the vocabulary, and 10% remain the same. The model was pre - trained with a sequence length of 1024 tokens and an effective batch size of 1.5M tokens for 315k update steps, totaling 472.5B tokens.
Hardware
Model pre - training was carried out using Google TPU - V4 accelerators, specifically a TPU v4 - 1024 with 512 devices, and it took approximately four days.
đ BibTeX entry and citation info
@article{mendoza2023foundational,
title={A Foundational Large Language Model for Edible Plant Genomes},
author={Mendoza - Revilla, Javier and Trop, Evan and Gonzalez, Liam and Roller, Masa and Dalla - Torre, Hugo and de Almeida, Bernardo P and Richard, Guillaume and Caton, Jonathan and Lopez Carranza, Nicolas and Skwark, Marcin and others},
journal={bioRxiv},
pages={2023--10},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}