đ DNA2Vec: Transformer-Based DNA Sequence Embedding
This repository offers an implementation of dna2vec
, a transformer-based model crafted for DNA sequence embeddings. It supports both Hugging Face (hf_model
) and locally trained models (local_model
), applicable to DNA sequence alignment, classification, and other genomic tasks.
đ Quick Start
To use the model, install the required dependencies:
pip install transformers torch
⨠Features
- Transformer-based architecture trained on genomic data.
- Reference-free embeddings that enable efficient sequence retrieval.
- Contrastive loss for self-supervised training, ensuring robust sequence similarity learning.
- Support for Hugging Face and custom-trained local models.
- Efficient search through a DNA vector store, reducing genome-wide alignment to a local search.
đ Documentation
Model Overview
DNA sequence alignment is a crucial genomic task that maps short DNA reads to the most likely locations in a reference genome. Traditional methods rely on genome indexing and efficient search algorithms, while recent advancements use transformer-based models to encode DNA sequences into vector representations.
The dna2vec
framework presents a Reference-Free DNA Embedding (RDE) Transformer model, which encodes DNA sequences into a shared vector space for efficient similarity search and sequence alignment.
Model Details
Model Architecture
The transformer model consists of:
- 12 attention heads
- 6 encoder layers
- Embedding dimension: 1020
- Vocabulary size: 10,000
- Cosine similarity-based sequence matching
- Dropout: 0.1
- Training: Cosine Annealing learning rate scheduling
đģ Usage Examples
Basic Usage
Load Hugging Face Model
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn as nn
def load_hf_model():
hf_model = AutoModel.from_pretrained("roychowdhuryresearch/dna2vec", trust_remote_code=True)
hf_tokenizer = AutoTokenizer.from_pretrained("roychowdhuryresearch/dna2vec", trust_remote_code=True)
class AveragePooler(nn.Module):
def forward(self, last_hidden, attention_mask):
return (last_hidden * attention_mask.unsqueeze(-1)).sum(1) / attention_mask.sum(-1).unsqueeze(-1)
hf_model.pooler = AveragePooler()
return hf_model, hf_tokenizer, hf_model.pooler
Advanced Usage
Using the Model
Once the model is loaded, you can use it to obtain embeddings for DNA sequences:
def get_embedding(dna_sequence):
model, tokenizer, pooler = load_hf_model()
tokenized_input = tokenizer(dna_sequence, return_tensors="pt", padding=True)
with torch.no_grad():
output = model(**tokenized_input)
embedding = pooler(output.last_hidden_state, tokenized_input.attention_mask)
return embedding.numpy()
dna_seq = "ATGCGTACGTAGCTAGCTAGC"
embedding = get_embedding(dna_seq)
print("Embedding shape:", embedding.shape)
đ§ Technical Details
Dataset
The training data consists of DNA sequences sampled from various chromosomes across species. The dataset covers approximately 2% of the human genome, ensuring generalization across different sequences. Reads are generated using ART MiSeq simulation, with variations in insertion and deletion rates.
Training Procedure
- Self-Supervised Learning: Contrastive loss-based training.
- Dynamic Length Sequences: DNA fragments of length 800 - 2000 with reads sampled in [150, 500].
- Noise Augmentation: 1 - 5% random base substitutions in 40% of training reads.
- Batch Size: 16 with gradient accumulation.
Evaluation
The model was evaluated against traditional aligners (Bowtie - 2) and other Transformer-based baselines (DNABERT - 2, HyenaDNA). The evaluation metrics include:
- Alignment Recall: >99% for high-quality reads.
- Cross-Species Transfer: Successfully aligns sequences from different species, including Thermus Aquaticus and Rattus Norvegicus.
đ License
This project is licensed under the MIT license.
đ Citation
If you use this model, please cite:
@article{10.1093/bioinformatics/btaf041,
author = {Holur, Pavan and Enevoldsen, K C and Rajesh, Shreyas and Mboning, Lajoyce and Georgiou, Thalia and Bouchard, Louis-S and Pellegrini, Matteo and Roychowdhury, Vwani},
title = {Embed-Search-Align: DNA Sequence Alignment using Transformer models},
journal = {Bioinformatics},
pages = {btaf041},
year = {2025},
month = {02},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btaf041},
url = {https://doi.org/10.1093/bioinformatics/btaf041},
eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btaf041/61778456/btaf041.pdf},
}
For more details, check the full paper.