GENA-LM Open-Source Model - Designed for Long DNA Sequence Analysis, Free Deployment to Boost Gene Research

Gena Lm Bert Large T2t

Developed by AIRI-Institute

GENA-LM is an open-source foundational model family for long DNA sequences, based on a Transformer masked language model trained on human DNA sequences.

Molecular Model

Transformers

Other#Long DNA sequence modeling #BPE tokenization #Genome annotation

Downloads 386

Release Time : 4/2/2023

Model Overview

The GENA-LM model is a Transformer masked language model trained on human DNA sequences, specifically designed for processing long DNA sequences.

Model Features

Long sequence processing capability

Input sequence length of approximately 4500 nucleotides (512 BPE tokens), significantly improved compared to DNABERT's 512 nucleotides

BPE tokenization

Uses BPE tokenization instead of k-mer tokenization, improving model processing efficiency

T2T genome pre-training

Pre-trained on the T2T human genome assembly rather than the GRCh38.p13 version

Pre-training data augmentation

Uses 1000 Genomes Project SNPs (gnomAD dataset) to sample mutations for data augmentation

Model Capabilities

DNA sequence analysis

Promoter prediction

Splice site prediction

Genome sequence annotation

Use Cases

Genomics research

300bp promoter prediction

Predicts 300bp-length DNA promoter regions

Specific performance metrics available in the paper

2000bp promoter prediction

Predicts 2000bp-length DNA promoter regions

Specific performance metrics available in the paper

Splice site prediction

Predicts splice sites in DNA sequences

Specific performance metrics available in the paper

Genome sequence annotation tools

GENA-Web application

Used for GENA-Web genome sequence annotation tool

🚀 GENA-LM (gena-lm-bert-large-t2t)

GENA-LM is a family of open - source foundational models designed for long DNA sequences. These models are transformer masked language models trained on human DNA sequences, offering unique capabilities in the field of genomics.

🚀 Quick Start

For quick access to the source code and data, visit the official repository: https://github.com/AIRI-Institute/GENA_LM. You can also refer to the related research paper: https://academic.oup.com/nar/article/53/2/gkae1310/7954523.

✨ Features

Key Differences from DNABERT

Tokenization: GENA - LM (gena-lm-bert-large-t2t) uses BPE tokenization instead of k - mers.
Input Sequence Size: It can handle an input sequence size of about 4500 nucleotides (512 BPE tokens), compared to DNABERT's 512 nucleotides.
Pre - training: GENA - LM is pre - trained on T2T, while DNABERT uses the GRCh38.p13 human genome assembly.

Finetuned Models

This repository contains models finetuned on various downstream tasks:

Promoters Prediction 300bp: Check the promoters_300_run_1 branch.
Promoters Prediction 2000bp: Available in the promoters_2000_run_1 branch.
Splice Sites Prediction: See the spliceai_run_1 branch.

Models for GENA - Web

The models used in the GENA - Web web tool for genomic sequence annotation can be found in the gena_web_promoters_2000 branch.

💻 Usage Examples

Basic Usage

How to load pre - trained model for Masked Language Modeling

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t')
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t', trust_remote_code=True)

Advanced Usage

How to load pre - trained model to fine - tune it on classification task

Method 1: Get model class from GENA - LM repository

git clone https://github.com/AIRI-Institute/GENA_LM.git

from GENA_LM.src.gena_lm.modeling_bert import BertForSequenceClassification
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t')
model = BertForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t')

You can also download modeling_bert.py and place it near your code.

Method 2: Get model class from HuggingFace AutoModel

from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t', trust_remote_code=True)
gena_module_name = model.__class__.__module__
print(gena_module_name)
import importlib
# available class names:
# - BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
# - BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
# - BertForQuestionAnswering
# check https://huggingface.co/docs/transformers/model_doc/bert
cls = getattr(importlib.import_module(gena_module_name), 'BertForSequenceClassification')
print(cls)
model = cls.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t', num_labels=2)

📚 Documentation

Model Description

The GENA - LM (gena-lm-bert-large-t2t) model is trained in a masked language model (MLM) fashion, following the methods proposed in the BigBird paper by masking 15% of tokens. The model configuration for gena-lm-bert-large-t2t is similar to bert-large-uncased:

Property	Details
Maximum sequence length	512
Layers	24
Attention heads	16
Hidden size	1024
Vocabulary size	32k

The gena-lm-bert-large-t2t model was pre - trained using the latest T2T human genome assembly (https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/). The data was augmented by sampling mutations from 1000 - genome SNPs (gnomAD dataset). Pre - training was performed for 1,750,000 iterations with a batch size of 256 and a sequence length of 512 tokens. The Transformer was modified with Pre - Layer normalization.

Evaluation

For detailed evaluation results, please refer to our research paper: https://academic.oup.com/nar/article/53/2/gkae1310/7954523.

📄 License

Citation

@article{GENA_LM,
    author = {Fishman, Veniamin and Kuratov, Yuri and Shmelev, Aleksei and Petrov, Maxim and Penzar, Dmitry and Shepelin, Denis and Chekanov, Nikolay and Kardymon, Olga and Burtsev, Mikhail},
    title = {GENA-LM: a family of open-source foundational DNA language models for long sequences},
    journal = {Nucleic Acids Research},
    volume = {53},
    number = {2},
    pages = {gkae1310},
    year = {2025},
    month = {01},
    issn = {0305-1048},
    doi = {10.1093/nar/gkae1310},
    url = {https://doi.org/10.1093/nar/gkae1310},
    eprint = {https://academic.oup.com/nar/article-pdf/53/2/gkae1310/61443229/gkae1310.pdf},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご