đ PlantCaduceus
PlantCaduceus is a pre - trained DNA language model, leveraging 16 Angiosperm genomes to learn evolutionary conservation and DNA sequence grammar.
đ Quick Start
PlantCaduceus is a DNA language model pre - trained on 16 Angiosperm genomes. It uses the [Caduceus](https://caduceus - dna.github.io/) and Mamba architectures and a masked language modeling objective. The model is designed to learn evolutionary conservation and DNA sequence grammar from 16 species with a 160 - million - year history.
A series of PlantCaduceus models have been trained with different parameter sizes:
- [PlantCaduceus_l20](https://huggingface.co/kuleshov - group/PlantCaduceus_l20): 20 layers, 384 hidden size, 20M parameters
- [PlantCaduceus_l24](https://huggingface.co/kuleshov - group/PlantCaduceus_l24): 24 layers, 512 hidden size, 40M parameters
- [PlantCaduceus_l28](https://huggingface.co/kuleshov - group/PlantCaduceus_l28): 28 layers, 768 hidden size, 112M parameters
- [PlantCaduceus_l32](https://huggingface.co/kuleshov - group/PlantCaduceus_l32): 32 layers, 1024 hidden size, 225M parameters
We highly recommend using the largest model ([PlantCaduceus_l32](https://huggingface.co/kuleshov - group/PlantCaduceus_l32)) for zero - shot score estimation.
đģ Usage Examples
Basic Usage
from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
import torch
model_path = 'kuleshov - group/PlantCaduceus_l20'
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True, device_map=device)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
sequence = "ATGCGTACGATCGTAG"
encoding = tokenizer.encode_plus(
sequence,
return_tensors="pt",
return_attention_mask=False,
return_token_type_ids=False
)
input_ids = encoding["input_ids"].to(device)
with torch.inference_mode():
outputs = model(input_ids=input_ids, output_hidden_states=True)
đ License
This project is licensed under the Apache - 2.0 license.
đ Citation
@article {Zhai2024.06.04.596709,
author = {Zhai, Jingjing and Gokaslan, Aaron and Schiff, Yair and Berthel, Ana and Liu, Zong - Yan and Miller, Zachary R and Scheben, Armin and Stitzer, Michelle C and Romay, Cinta and Buckler, Edward S. and Kuleshov, Volodymyr},
title = {Cross - species plant genomes modeling at single nucleotide resolution using a pre - trained DNA language model},
elocation - id = {2024.06.04.596709},
year = {2024},
doi = {10.1101/2024.06.04.596709},
URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},
eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},
journal = {bioRxiv}
}
đ Contact
Jingjing Zhai (jz963@cornell.edu)