Caduceus - An Open - Source DNA Sequence Modeling Model on PS, Free to Deploy and Process Long DNA Sequences!

Caduceus Ps Seqlen 131k D Model 256 N Layer 16

Developed by kuleshov-group

Caduceus-PS is a DNA sequence modeling model with reverse-complement equivariance, designed for processing long sequences.

Molecular Model

Transformers

Open Source License:Apache-2.0 #DNA sequence modeling #Bidirectional equivariance #Long-range context

Downloads 2,618

Release Time : 2/29/2024

Model Overview

This model is used for masked language modeling of DNA sequences, capable of handling long sequences and supporting reverse-complement equivariance.

Model Features

Reverse-complement equivariance

The model has reverse-complement equivariance and does not require RC data augmentation during training.

Long sequence processing

Supports processing DNA sequences up to 131,072 base pairs in length.

Efficient training

Pre-trained on the human reference genome for 50k steps, with each step containing approximately 1 million base pairs/tokens.

Model Capabilities

DNA sequence modeling

Long sequence processing

Masked language modeling

Use Cases

Genome research

DNA sequence prediction

Used to predict masked portions in DNA sequences.

🚀 Transformers

The transformers library provides pre - trained models and tools for natural language processing and related tasks, enabling users to easily utilize state - of - the - art models.

🚀 Quick Start

To use the pre - trained model for masked language modeling, use the following snippet:

from transformers import AutoModelForMaskedLM, AutoTokenizer

# See the `Caduceus` collection page on the hub for list of available models.
model_name = "kuleshov-group/caduceus-ps_seqlen-131k_d_model-256_n_layer-16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Alternatively, you can instantiate a model from scratch to train on your own data as follows:

from transformers import AutoConfig, AutoModelForMaskedLM

# Add any config overrides here, see the `config.json` file on the hub for details.
config_overrides = {}
# See the `Caduceus` collection page on the hub for list of available models.
config = AutoConfig.from_pretrained(
 "kuleshov-group/caduceus-ps_seqlen-131k_d_model-256_n_layer-16",
 **config_overrides,
) 
model = AutoModelForMaskedLM.from_config(config)

💻 Usage Examples

Basic Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

# See the `Caduceus` collection page on the hub for list of available models.
model_name = "kuleshov-group/caduceus-ps_seqlen-131k_d_model-256_n_layer-16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Advanced Usage

from transformers import AutoConfig, AutoModelForMaskedLM

# Add any config overrides here, see the `config.json` file on the hub for details.
config_overrides = {}
# See the `Caduceus` collection page on the hub for list of available models.
config = AutoConfig.from_pretrained(
 "kuleshov-group/caduceus-ps_seqlen-131k_d_model-256_n_layer-16",
 **config_overrides,
) 
model = AutoModelForMaskedLM.from_config(config)

📚 Documentation

Model Details

This is the Caduceus - PS model with hidden dimension 256 and 16 MambaDNA layers. This model is reverse complement (RC) equivariant and thus no RC data augmentation is required when training this model, either during pre - training or for downstream fine - tuning. Note that the model hidden state will be twice that of a non - RC equivariant counterpart. For downstream task training and inference, and to ensure RC invariant outputs at downstream time, one can either run the downstream model on the hidden state and its RC or one can take the hidden state and its RC and average them before passing to the downstream model. To RC the hidden states, one can use: hidden_states.flip(dim=(-2, -1)) which will flip along the sequence lenght and channel dimensions.

This model was pre - trained on the human reference genome with sequence length 131,072 for 50k steps (each step contained ~1M base pairs / tokens).

For more details, please see our paper: Caduceus: Bi - Directional Equivariant Long - Range DNA Sequence Modeling.

Citation

Please cite our work using the bibtex below:

BibTeX:

@article{schiff2024caduceus,
  title={Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling},
  author={Schiff, Yair and Kao, Chia-Hsiang and Gokaslan, Aaron and Dao, Tri and Gu, Albert and Kuleshov, Volodymyr},
  journal={arXiv preprint arXiv:2403.03234},
  year={2024}
}

Model Card Contact

Yair Schiff (yzs2@cornell.edu)

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご