PlantCaduceus_l20开源DNA语言模型 - 助力研究植物进化保守性与序列语法

首页

Plantcaduceus L20

由 kuleshov-group 开发

PlantCaduceus是一个基于16种被子植物基因组预训练的DNA语言模型，采用Caduceus和Mamba架构，通过掩码语言建模目标学习进化保守性和DNA序列语法。

分子模型

Transformers

开源协议:Apache-2.0 #植物基因组建模 #DNA语言模型 #跨物种进化分析

下载量 8,967

发布时间 : 5/19/2024

模型简介

PlantCaduceus是一个DNA语言模型，专门用于处理和分析植物基因组序列，能够学习进化保守性和DNA序列语法。

模型特点

多物种基因组预训练

基于16种被子植物基因组进行预训练，涵盖1.6亿年演化历史。

多种参数规模

提供从2000万到2.25亿参数的不同规模模型，适应不同计算需求。

进化保守性学习

能够学习DNA序列中的进化保守性和语法规则。

模型能力

DNA序列分析

基因组掩码语言建模

进化保守性预测

使用案例

基因组研究

DNA序列评分

使用模型对DNA序列进行零样本评分估计。

进化保守性分析

分析不同物种DNA序列中的保守区域。

🚀 PlantCaduceus - 植物DNA语言模型

PlantCaduceus是一个基于16种被子植物基因组进行预训练的DNA语言模型。它利用Caduceus和Mamba架构以及掩码语言建模目标，旨在从跨越1.6亿年进化历史的16个物种中学习进化保守性和DNA序列语法。

🚀 快速开始

本项目提供了一系列不同参数规模的PlantCaduceus模型，你可以根据需求选择合适的模型。对于零样本得分估计，我们强烈建议使用最大的模型 PlantCaduceus_l32。

✨ 主要特性

跨物种学习：基于16种被子植物基因组进行预训练，能够学习到跨越1.6亿年进化历史的物种的进化保守性和DNA序列语法。
多模型选择：提供了不同参数规模的模型，包括 PlantCaduceus_l20、PlantCaduceus_l24、PlantCaduceus_l28 和 PlantCaduceus_l32，满足不同的计算资源和任务需求。

💻 使用示例

基础用法

from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
import torch
model_path = 'kuleshov-group/PlantCaduceus_l20'
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True, device_map=device)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

sequence = "ATGCGTACGATCGTAG"
encoding = tokenizer.encode_plus(
            sequence,
            return_tensors="pt",
            return_attention_mask=False,
            return_token_type_ids=False
        )
input_ids = encoding["input_ids"].to(device)
with torch.inference_mode():
    outputs = model(input_ids=input_ids, output_hidden_states=True)

📚 详细文档

模型参数

属性	详情
模型类型	PlantCaduceus
训练数据	16种被子植物基因组

模型列表

PlantCaduceus_l20: 20层，隐藏层大小384，2000万个参数
PlantCaduceus_l24: 24层，隐藏层大小512，4000万个参数
PlantCaduceus_l28: 28层，隐藏层大小768，1.12亿个参数
PlantCaduceus_l32: 32层，隐藏层大小1024，2.25亿个参数

📄 许可证

本项目采用Apache-2.0许可证。

📚 引用信息

如果你使用了本项目的模型或代码，请引用以下论文：

@article {Zhai2024.06.04.596709,
	author = {Zhai, Jingjing and Gokaslan, Aaron and Schiff, Yair and Berthel, Ana and Liu, Zong-Yan and Miller, Zachary R and Scheben, Armin and Stitzer, Michelle C and Romay, Cinta and Buckler, Edward S. and Kuleshov, Volodymyr},
	title = {Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model},
	elocation-id = {2024.06.04.596709},
	year = {2024},
	doi = {10.1101/2024.06.04.596709},
	URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},
	eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},
	journal = {bioRxiv}
}