PlantCaduceus_l20 Open-source DNA Language Model - Empowering Research on Plant Evolutionary Conservation and Sequence Grammar

Plantcaduceus L20

Developed by kuleshov-group

PlantCaduceus is a DNA language model pre-trained on 16 angiosperm genomes, utilizing Caduceus and Mamba architectures to learn evolutionary conservation and DNA sequence syntax through masked language modeling objectives.

Molecular Model

Transformers

Open Source License:Apache-2.0 #Plant Genome Modeling #DNA Language Model #Cross-Species Evolutionary Analysis

Downloads 8,967

Release Time : 5/19/2024

Model Overview

PlantCaduceus is a DNA language model specifically designed for processing and analyzing plant genome sequences, capable of learning evolutionary conservation and DNA sequence syntax.

Model Features

Multi-Species Genome Pre-training

Pre-trained on 16 angiosperm genomes, covering 160 million years of evolutionary history.

Multiple Parameter Scales

Offers models ranging from 20 million to 225 million parameters to accommodate different computational needs.

Evolutionary Conservation Learning

Capable of learning evolutionary conservation and syntax rules in DNA sequences.

Model Capabilities

DNA sequence analysis

Genome masked language modeling

Evolutionary conservation prediction

Use Cases

Genome Research

DNA Sequence Scoring

Use the model for zero-shot scoring estimation of DNA sequences.

Evolutionary Conservation Analysis

Analyze conserved regions in DNA sequences across different species.

🚀 PlantCaduceus

PlantCaduceus is a pre - trained DNA language model, leveraging 16 Angiosperm genomes to learn evolutionary conservation and DNA sequence grammar.

🚀 Quick Start

PlantCaduceus is a DNA language model pre - trained on 16 Angiosperm genomes. It uses the [Caduceus](https://caduceus - dna.github.io/) and Mamba architectures and a masked language modeling objective. The model is designed to learn evolutionary conservation and DNA sequence grammar from 16 species with a 160 - million - year history.

A series of PlantCaduceus models have been trained with different parameter sizes:

[PlantCaduceus_l20](https://huggingface.co/kuleshov - group/PlantCaduceus_l20): 20 layers, 384 hidden size, 20M parameters
[PlantCaduceus_l24](https://huggingface.co/kuleshov - group/PlantCaduceus_l24): 24 layers, 512 hidden size, 40M parameters
[PlantCaduceus_l28](https://huggingface.co/kuleshov - group/PlantCaduceus_l28): 28 layers, 768 hidden size, 112M parameters
[PlantCaduceus_l32](https://huggingface.co/kuleshov - group/PlantCaduceus_l32): 32 layers, 1024 hidden size, 225M parameters

We highly recommend using the largest model ([PlantCaduceus_l32](https://huggingface.co/kuleshov - group/PlantCaduceus_l32)) for zero - shot score estimation.

💻 Usage Examples

Basic Usage

from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
import torch
model_path = 'kuleshov - group/PlantCaduceus_l20'
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True, device_map=device)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

sequence = "ATGCGTACGATCGTAG"
encoding = tokenizer.encode_plus(
            sequence,
            return_tensors="pt",
            return_attention_mask=False,
            return_token_type_ids=False
        )
input_ids = encoding["input_ids"].to(device)
with torch.inference_mode():
    outputs = model(input_ids=input_ids, output_hidden_states=True)

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Citation

@article {Zhai2024.06.04.596709,
	author = {Zhai, Jingjing and Gokaslan, Aaron and Schiff, Yair and Berthel, Ana and Liu, Zong - Yan and Miller, Zachary R and Scheben, Armin and Stitzer, Michelle C and Romay, Cinta and Buckler, Edward S. and Kuleshov, Volodymyr},
	title = {Cross - species plant genomes modeling at single nucleotide resolution using a pre - trained DNA language model},
	elocation - id = {2024.06.04.596709},
	year = {2024},
	doi = {10.1101/2024.06.04.596709},
	URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},
	eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},
	journal = {bioRxiv}
}

📞 Contact

Jingjing Zhai (jz963@cornell.edu)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご