dna2vec Open-Source DNA Sequence Embedding Model - Supports Sequence Alignment and Genomics Applications

Home

Dna2vec

Developed by roychowdhuryresearch

DNA sequence embedding model based on Transformer architecture, supporting sequence alignment and genomics applications

Molecular Model

Transformers

Open Source License:MIT #Genomic Sequence Embedding #Reference-Free Alignment #Transformer Architecture

Downloads 557

Release Time : 2/16/2025

Model Overview

DNA2Vec is an innovative DNA sequence embedding model that uses Transformer architecture to map DNA sequences into a shared vector space, enabling efficient similarity search and sequence alignment.

Model Features

Reference-Free Genome Embedding

Innovatively achieves DNA sequence embedding without requiring a reference genome

Contrastive Loss Training

Uses self-supervised contrastive loss to ensure robust sequence similarity learning

Dual Version Support

Provides both Hugging Face model and locally trainable versions

Efficient Vector Search

Transforms whole-genome alignment into a local search problem through DNA vector databases

Model Capabilities

DNA sequence vectorization

Sequence similarity calculation

Cross-species sequence alignment

Genomic variant detection

Use Cases

Genomics Research

Sequence Alignment

Efficiently aligns sequencing reads with reference genomes

High-quality read recall rate >99%

Cross-Species Analysis

Analyzes DNA sequence similarity across different species

Successfully aligned sequences from species like Thermus aquaticus and Rattus norvegicus

Bioinformatics Tools

Variant Detection

Identifies insertions and deletions in DNA sequences

🚀 DNA2Vec: Transformer-Based DNA Sequence Embedding

This repository offers an implementation of dna2vec, a transformer-based model crafted for DNA sequence embeddings. It supports both Hugging Face (hf_model) and locally trained models (local_model), applicable to DNA sequence alignment, classification, and other genomic tasks.

🚀 Quick Start

To use the model, install the required dependencies:

pip install transformers torch

✨ Features

Transformer-based architecture trained on genomic data.
Reference-free embeddings that enable efficient sequence retrieval.
Contrastive loss for self-supervised training, ensuring robust sequence similarity learning.
Support for Hugging Face and custom-trained local models.
Efficient search through a DNA vector store, reducing genome-wide alignment to a local search.

📚 Documentation

Model Overview

DNA sequence alignment is a crucial genomic task that maps short DNA reads to the most likely locations in a reference genome. Traditional methods rely on genome indexing and efficient search algorithms, while recent advancements use transformer-based models to encode DNA sequences into vector representations.

The dna2vec framework presents a Reference-Free DNA Embedding (RDE) Transformer model, which encodes DNA sequences into a shared vector space for efficient similarity search and sequence alignment.

Model Details

Model Architecture

The transformer model consists of:

12 attention heads
6 encoder layers
Embedding dimension: 1020
Vocabulary size: 10,000
Cosine similarity-based sequence matching
Dropout: 0.1
Training: Cosine Annealing learning rate scheduling

💻 Usage Examples

Basic Usage

Load Hugging Face Model

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn as nn

def load_hf_model():
    hf_model = AutoModel.from_pretrained("roychowdhuryresearch/dna2vec", trust_remote_code=True)
    hf_tokenizer = AutoTokenizer.from_pretrained("roychowdhuryresearch/dna2vec", trust_remote_code=True)
    
    class AveragePooler(nn.Module):
        def forward(self, last_hidden, attention_mask):
            return (last_hidden * attention_mask.unsqueeze(-1)).sum(1) / attention_mask.sum(-1).unsqueeze(-1)
    
    hf_model.pooler = AveragePooler()
    return hf_model, hf_tokenizer, hf_model.pooler

Advanced Usage

Using the Model

Once the model is loaded, you can use it to obtain embeddings for DNA sequences:

def get_embedding(dna_sequence):
    model, tokenizer, pooler = load_hf_model()
    tokenized_input = tokenizer(dna_sequence, return_tensors="pt", padding=True)
    with torch.no_grad():
        output = model(**tokenized_input)
    embedding = pooler(output.last_hidden_state, tokenized_input.attention_mask)
    return embedding.numpy()

# Example usage
dna_seq = "ATGCGTACGTAGCTAGCTAGC"
embedding = get_embedding(dna_seq)
print("Embedding shape:", embedding.shape)

🔧 Technical Details

Dataset

The training data consists of DNA sequences sampled from various chromosomes across species. The dataset covers approximately 2% of the human genome, ensuring generalization across different sequences. Reads are generated using ART MiSeq simulation, with variations in insertion and deletion rates.

Training Procedure

Self-Supervised Learning: Contrastive loss-based training.
Dynamic Length Sequences: DNA fragments of length 800 - 2000 with reads sampled in [150, 500].
Noise Augmentation: 1 - 5% random base substitutions in 40% of training reads.
Batch Size: 16 with gradient accumulation.

Evaluation

The model was evaluated against traditional aligners (Bowtie - 2) and other Transformer-based baselines (DNABERT - 2, HyenaDNA). The evaluation metrics include:

Alignment Recall: >99% for high-quality reads.
Cross-Species Transfer: Successfully aligns sequences from different species, including Thermus Aquaticus and Rattus Norvegicus.

📄 License

This project is licensed under the MIT license.

📚 Citation

If you use this model, please cite:

@article{10.1093/bioinformatics/btaf041,
    author = {Holur, Pavan and Enevoldsen, K C and Rajesh, Shreyas and Mboning, Lajoyce and Georgiou, Thalia and Bouchard, Louis-S and Pellegrini, Matteo and Roychowdhury, Vwani},
    title = {Embed-Search-Align: DNA Sequence Alignment using Transformer models},
    journal = {Bioinformatics},
    pages = {btaf041},
    year = {2025},
    month = {02},
    issn = {1367-4811},
    doi = {10.1093/bioinformatics/btaf041},
    url = {https://doi.org/10.1093/bioinformatics/btaf041},
    eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btaf041/61778456/btaf041.pdf},
}

For more details, check the full paper.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご