N

Nucleotide Transformer V2 50m 3mer Multi Species

Developed by InstaDeepAI
DNA sequence foundation language model pre-trained on 850 species genomes, specializing in protein task prediction
Downloads 33
Release Time : 5/8/2024

Model Overview

This model integrates multi-species genomic data (including over 3,200 human genomes and 850 diverse species) to provide high-precision molecular phenotype prediction capabilities, specifically optimized for downstream protein tasks

Model Features

Multi-species genome integration
Pre-training data covers 850 species (including model and non-model organisms), breaking through the limitations of single reference genomes
3mer tokenization optimization
Adopts 3mer tokenization strategy to enhance fine-grained protein prediction capability with a vocabulary size of 4,105
Enhanced architecture design
Uses rotary position encoding instead of traditional learned encoding and introduces gated linear units to improve model performance
Large-scale pre-training
Trained on 174 billion nucleotides (29 billion tokens) with large-scale batch processing of 1 million tokens

Model Capabilities

DNA sequence embedding generation
Masked nucleotide prediction
Protein function inference
Genomic feature extraction

Use Cases

Genomics research
Conserved sequence analysis
Identify evolutionarily conserved regions through cross-species sequence alignment
Can detect homologous sequences in distantly related species that are difficult to identify with traditional methods
Protein-coding region prediction
Predict potential protein-coding regions based on DNA sequences
Excellent performance on the InstaDeepAI/true-cds-protein-tasks dataset
Biomedical applications
Disease-associated variant detection
Identify DNA variants that may cause protein dysfunction
Significantly improved prediction sensitivity for non-coding region variants
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase