G

Gena Lm Bert Large T2t

Developed by AIRI-Institute
GENA-LM is an open-source foundational model family for long DNA sequences, based on a Transformer masked language model trained on human DNA sequences.
Downloads 386
Release Time : 4/2/2023

Model Overview

The GENA-LM model is a Transformer masked language model trained on human DNA sequences, specifically designed for processing long DNA sequences.

Model Features

Long sequence processing capability
Input sequence length of approximately 4500 nucleotides (512 BPE tokens), significantly improved compared to DNABERT's 512 nucleotides
BPE tokenization
Uses BPE tokenization instead of k-mer tokenization, improving model processing efficiency
T2T genome pre-training
Pre-trained on the T2T human genome assembly rather than the GRCh38.p13 version
Pre-training data augmentation
Uses 1000 Genomes Project SNPs (gnomAD dataset) to sample mutations for data augmentation

Model Capabilities

DNA sequence analysis
Promoter prediction
Splice site prediction
Genome sequence annotation

Use Cases

Genomics research
300bp promoter prediction
Predicts 300bp-length DNA promoter regions
Specific performance metrics available in the paper
2000bp promoter prediction
Predicts 2000bp-length DNA promoter regions
Specific performance metrics available in the paper
Splice site prediction
Predicts splice sites in DNA sequences
Specific performance metrics available in the paper
Genome sequence annotation tools
GENA-Web application
Used for GENA-Web genome sequence annotation tool
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase