GENA-LM開源模型 - 面向長DNA序列分析，免費部署助力基因研究

首頁

Gena Lm Bert Large T2t

由AIRI-Institute開發

GENA-LM 是一個面向長DNA序列的開源基礎模型家族，基於人類DNA序列訓練的Transformer掩碼語言模型。

分子模型

Transformers

其他#長DNA序列建模 #BPE分詞 #基因組註釋

下載量 386

發布時間 : 4/2/2023

模型概述

GENA-LM模型是基於人類DNA序列訓練的Transformer掩碼語言模型，專門設計用於處理長DNA序列。

模型特點

長序列處理能力

輸入序列長度約4500個核苷酸（512個BPE標記），相比DNABERT的512個核苷酸有顯著提升

BPE分詞

採用BPE分詞而非k-mer分詞，提高了模型處理效率

T2T基因組預訓練

基於T2T人類基因組組裝進行預訓練，而非GRCh38.p13版本

預訓練數據增強

使用1000基因組計劃SNPs(gnomAD數據集)採樣突變進行數據增強

模型能力

DNA序列分析

啟動子預測

剪接位點預測

基因組序列註釋

使用案例

基因組學研究

300bp啟動子預測

預測300bp長度的DNA啟動子區域

具體性能指標見論文

2000bp啟動子預測

預測2000bp長度的DNA啟動子區域

具體性能指標見論文

剪接位點預測

預測DNA序列中的剪接位點

具體性能指標見論文

基因組序列註釋工具

GENA-Web應用

用於GENA-Web基因組序列註釋工具

🚀 GENA-LM (gena-lm-bert-large-t2t)

GENA-LM是一系列用於長DNA序列的開源基礎模型。這些模型基於Transformer架構，在人類DNA序列上進行訓練，能夠為DNA序列分析提供強大的支持。

項目鏈接

源代碼和數據：https://github.com/AIRI-Institute/GENA_LM
論文：https://academic.oup.com/nar/article/53/2/gkae1310/7954523

🚀 快速開始

模型差異

GENA-LM (gena-lm-bert-large-t2t) 與DNABERT的主要區別如下：

使用BPE分詞，而非k-mers；
輸入序列大小約為4500個核苷酸（512個BPE標記），而DNABERT為512個核苷酸；
在T2T人類基因組組裝上進行預訓練，而不是GRCh38.p13。

下游任務微調模型

本倉庫還包含在下游任務上進行微調的模型：

300bp啟動子預測（分支 promoters_300_run_1）
2000bp啟動子預測（分支 promoters_2000_run_1）
剪接位點預測（分支 spliceai_run_1）

以及在我們的 GENA-Web 基因組序列註釋網絡工具中使用的模型：

2000bp啟動子預測（分支 gena_web_promoters_2000）

💻 使用示例

基礎用法

如何加載用於掩碼語言建模的預訓練模型

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t')
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t', trust_remote_code=True)

高級用法

如何加載預訓練模型以在分類任務上進行微調

方法一：從GENA-LM倉庫獲取模型類

git clone https://github.com/AIRI-Institute/GENA_LM.git

from GENA_LM.src.gena_lm.modeling_bert import BertForSequenceClassification
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t')
model = BertForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t')

或者，你可以直接下載 modeling_bert.py 並將其放在你的代碼附近。

方法二：從HuggingFace AutoModel獲取模型類

from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t', trust_remote_code=True)
gena_module_name = model.__class__.__module__
print(gena_module_name)
import importlib
# 可用的類名:
# - BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
# - BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
# - BertForQuestionAnswering
# 查看 https://huggingface.co/docs/transformers/model_doc/bert
cls = getattr(importlib.import_module(gena_module_name), 'BertForSequenceClassification')
print(cls)
model = cls.from_pretrained('AIRI-Institute/gena-lm-bert-large-t2t', num_labels=2)

📚 詳細文檔

模型描述

GENA-LM (gena-lm-bert-large-t2t) 模型以掩碼語言模型（MLM）的方式進行訓練，遵循BigBird論文中提出的方法，對15%的標記進行掩碼。gena-lm-bert-large-t2t 的模型配置與 bert-large-uncased 類似：

最大序列長度：512
層數：24
注意力頭數：16
隱藏層大小：1024
詞彙表大小：32k

我們使用最新的T2T人類基因組組裝（https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.3/）對 gena-lm-bert-large-t2t 進行預訓練。數據通過從1000個基因組SNP（gnomAD數據集）中採樣突變進行增強。預訓練進行了1750000次迭代，批次大小為256，序列長度為512個標記。我們使用 Pre-Layer normalization 對Transformer進行了修改。

評估

有關評估結果，請參閱我們的論文：https://academic.oup.com/nar/article/53/2/gkae1310/7954523

📄 許可證

引用

@article{GENA_LM,
    author = {Fishman, Veniamin and Kuratov, Yuri and Shmelev, Aleksei and Petrov, Maxim and Penzar, Dmitry and Shepelin, Denis and Chekanov, Nikolay and Kardymon, Olga and Burtsev, Mikhail},
    title = {GENA-LM: a family of open-source foundational DNA language models for long sequences},
    journal = {Nucleic Acids Research},
    volume = {53},
    number = {2},
    pages = {gkae1310},
    year = {2025},
    month = {01},
    issn = {0305-1048},
    doi = {10.1093/nar/gkae1310},
    url = {https://doi.org/10.1093/nar/gkae1310},
    eprint = {https://academic.oup.com/nar/article-pdf/53/2/gkae1310/61443229/gkae1310.pdf},
}