RNABERT开源模型 - 基于非编码RNA，助力生物相关预训练分析应用

首页

Rnabert

由 multimolecule 开发

RNABERT是基于非编码RNA（ncRNA）的预训练模型，采用掩码语言建模（MLM）和结构对齐学习（SAL）目标。

分子模型

Safetensors

其他#RNA序列分析 #结构对齐预测 #非编码RNA研究

下载量 8,166

发布时间 : 9/10/2024

模型简介

RNABERT是一个bert风格的模型，通过自监督方式在大量非编码RNA序列上进行预训练，主要用于RNA序列的特征提取和结构对齐。

模型特点

双目标预训练

同时采用掩码语言建模(MLM)和结构对齐学习(SAL)两种预训练目标

RNA专用模型

专门针对非编码RNA序列设计和训练

轻量级架构

仅0.48M参数，适合RNA序列处理任务

模型能力

RNA序列特征提取

RNA结构对齐预测

RNA序列掩码预测

使用案例

生物信息学

RNA功能聚类

利用模型提取的RNA序列特征进行功能聚类分析

RNA结构对齐

预测两个RNA序列之间的结构对齐关系

🚀 RNABERT

RNABERT是一个基于自监督学习方式，在大量非编码RNA（ncRNA）序列语料库上预训练的BERT风格模型。该模型仅在RNA序列的原始核苷酸上进行训练，通过自动流程从这些文本中生成输入和标签。

🚀 快速开始

本模型文件依赖于multimolecule库，你可以使用pip进行安装：

pip install multimolecule

直接使用

你可以直接使用此模型进行掩码语言建模：

>>> import multimolecule  # 你必须导入multimolecule以注册模型
>>> from transformers import pipeline
>>> unmasker = pipeline("fill-mask", model="multimolecule/rnabert")
>>> unmasker("gguc<mask>cucugguuagaccagaucugagccu")

[{'score': 0.03852083534002304,
  'token': 24,
  'token_str': '-',
  'sequence': 'G G U C - C U C U G G U U A G A C C A G A U C U G A G C C U'},
 {'score': 0.03851056098937988,
  'token': 10,
  'token_str': 'N',
  'sequence': 'G G U C N C U C U G G U U A G A C C A G A U C U G A G C C U'},
 {'score': 0.03849703073501587,
  'token': 25,
  'token_str': 'I',
  'sequence': 'G G U C I C U C U G G U U A G A C C A G A U C U G A G C C U'},
 {'score': 0.03848597779870033,
  'token': 3,
  'token_str': '<unk>',
  'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'},
 {'score': 0.038484156131744385,
  'token': 5,
  'token_str': '<null>',
  'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'}]

下游使用

提取特征

以下是如何在PyTorch中使用此模型获取给定序列的特征：

from multimolecule import RnaTokenizer, RnaBertModel

tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnabert")
model = RnaBertModel.from_pretrained("multimolecule/rnabert")

text = "UAGCUUAUCAGACUGAUGUUGA"
input = tokenizer(text, return_tensors="pt")

output = model(**input)

序列分类/回归

注意：此模型未针对任何特定任务进行微调。你需要在下游任务上微调模型，以将其用于序列分类或回归。以下是如何在PyTorch中使用此模型作为骨干进行序列级任务的微调：

import torch
from multimolecule import RnaTokenizer, RnaBertForSequencePrediction

tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnabert")
model = RnaBertForSequencePrediction.from_pretrained("multimolecule/rnabert")

text = "UAGCUUAUCAGACUGAUGUUGA"
input = tokenizer(text, return_tensors="pt")
label = torch.tensor([1])

output = model(**input, labels=label)

标记分类/回归

注意：此模型未针对任何特定任务进行微调。你需要在下游任务上微调模型，以将其用于核苷酸分类或回归。以下是如何在PyTorch中使用此模型作为骨干进行核苷酸级任务的微调：

import torch
from multimolecule import RnaTokenizer, RnaBertForTokenPrediction

tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnabert")
model = RnaBertForTokenPrediction.from_pretrained("multimolecule/rnabert")

text = "UAGCUUAUCAGACUGAUGUUGA"
input = tokenizer(text, return_tensors="pt")
label = torch.randint(2, (len(text), ))

output = model(**input, labels=label)

接触分类/回归

注意：此模型未针对任何特定任务进行微调。你需要在下游任务上微调模型，以将其用于接触分类或回归。以下是如何在PyTorch中使用此模型作为骨干进行接触级任务的微调：

import torch
from multimolecule import RnaTokenizer, RnaBertForContactPrediction

tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnabert")
model = RnaBertForContactPrediction.from_pretrained("multimolecule/rnabert")

text = "UAGCUUAUCAGACUGAUGUUGA"
input = tokenizer(text, return_tensors="pt")
label = torch.randint(2, (len(text), len(text)))

output = model(**input, labels=label)

✨ 主要特性

RNABERT是一个基于自监督学习方式，在大量非编码RNA序列语料库上预训练的BERT风格模型。该模型有两个预训练目标：掩码语言建模（MLM）和结构对齐学习（SAL）。

📚 详细文档

模型详情

RNABERT是一个基于自监督学习方式，在大量非编码RNA序列语料库上预训练的BERT风格模型。这意味着该模型仅在RNA序列的原始核苷酸上进行训练，通过自动流程从这些文本中生成输入和标签。有关训练过程的更多信息，请参阅训练详情部分。

模型规格

层数	隐藏层大小	头数	中间层大小	参数数量（M）	浮点运算次数（G）	乘累加运算次数（G）	最大标记数
6	120	12	40	0.48	0.15	0.08	440

链接

代码：multimolecule.rnabert
权重：multimolecule/rnabert
数据：RNAcentral
论文：Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning
开发者：Manato Akiyama和Yasubumi Sakakibara
模型类型：BERT
原始仓库：mana438/RNABERT

训练详情

RNABERT有两个预训练目标：掩码语言建模（MLM）和结构对齐学习（SAL）。

掩码语言建模（MLM）：给定一个序列，模型随机掩码输入中15%的标记，然后将整个掩码句子输入模型，并预测被掩码的标记。这类似于语言建模中的完形填空任务。
结构对齐学习（SAL）：模型学习预测两个RNA序列的结构对齐。该模型使用Needleman-Wunsch算法训练预测两个RNA序列的对齐分数。

训练数据

RNABERT模型在RNAcentral上进行预训练。 RNAcentral是一个免费的公共资源，提供对由一组专家数据库提供的全面且最新的非编码RNA序列的集成访问，这些数据库代表了广泛的生物和RNA类型。 RNABERT使用了RNAcentral中76,237个人类ncRNA序列的子集进行预训练。 RNABERT通过将所有标记中的“U”替换为“T”来预处理所有标记。请注意，在模型转换期间，“T”会被替换为“U”。[RnaTokenizer][multimolecule.RnaTokenizer]会为你将“T”转换为“U”，你可以通过传递replace_T_with_U=False来禁用此行为。

训练过程

预处理

RNABERT通过对72,237个人类ncRNA序列应用10种不同的掩码模式来预处理数据集。最终数据集包含722,370个序列。掩码过程类似于BERT中使用的过程：

15%的标记被掩码。
在80%的情况下，被掩码的标记被<mask>替换。
在10%的情况下，被掩码的标记被一个与它们替换的标记不同的随机标记替换。
在剩下的10%的情况下，被掩码的标记保持不变。

预训练

该模型在1块NVIDIA V100 GPU上进行训练。

免责声明

这是Informative RNA base embedding for functional RNA clustering and structural alignment的非官方实现，作者是Manato Akiyama和Yasubumi Sakakibara。 RNABERT的官方仓库位于mana438/RNABERT。

⚠️ 重要提示

MultiMolecule团队意识到在复现RNABERT结果时存在潜在风险。

RNABERT的原始实现不会在输入序列前添加<cls>并在末尾添加<eos>标记。在大多数情况下，这不会影响模型的性能，但在某些情况下可能会导致意外行为。

如果你希望实现与原始实现完全相同的行为，请在分词器中明确设置cls_token=None和eos_token=None。

💡 使用建议

MultiMolecule团队已确认提供的模型和检查点与原始实现产生相同的中间表示。

引用

BibTeX：

@article{akiyama2022informative,
    author = {Akiyama, Manato and Sakakibara, Yasubumi},
    title = "{Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning}",
    journal = {NAR Genomics and Bioinformatics},
    volume = {4},
    number = {1},
    pages = {lqac012},
    year = {2022},
    month = {02},
    abstract = "{Effective embedding is actively conducted by applying deep learning to biomolecular information. Obtaining better embeddings enhances the quality of downstream analyses, such as DNA sequence motif detection and protein function prediction. In this study, we adopt a pre-training algorithm for the effective embedding of RNA bases to acquire semantically rich representations and apply this algorithm to two fundamental RNA sequence problems: structural alignment and clustering. By using the pre-training algorithm to embed the four bases of RNA in a position-dependent manner using a large number of RNA sequences from various RNA families, a context-sensitive embedding representation is obtained. As a result, not only base information but also secondary structure and context information of RNA sequences are embedded for each base. We call this ‘informative base embedding’ and use it to achieve accuracies superior to those of existing state-of-the-art methods on RNA structural alignment and RNA family clustering tasks. Furthermore, upon performing RNA sequence alignment by combining this informative base embedding with a simple Needleman–Wunsch alignment algorithm, we succeed in calculating structural alignments with a time complexity of O(n2) instead of the O(n6) time complexity of the naive implementation of Sankoff-style algorithm for input RNA sequence of length n.}",
    issn = {2631-9268},
    doi = {10.1093/nargab/lqac012},
    url = {https://doi.org/10.1093/nargab/lqac012},
    eprint = {https://academic.oup.com/nargab/article-pdf/4/1/lqac012/42577168/lqac012.pdf},
}

联系信息

如果你对本模型卡片有任何问题或建议，请使用MultiMolecule的GitHub问题。如果你对论文或模型有任何问题或建议，请联系RNABERT论文的作者。

📄 许可证

本模型采用AGPL-3.0许可证。

SPDX-License-Identifier: AGPL-3.0-or-later

精选推荐AI模型

Llama 3 Typhoon V1.5x 8b Instruct

专为泰语设计的80亿参数指令模型，性能媲美GPT-3.5-turbo，优化了应用场景、检索增强生成、受限生成和推理任务

Cadet-Tiny是一个基于SODA数据集训练的超小型对话模型，专为边缘设备推理设计，体积仅为Cosmo-3B模型的2%左右。

Roberta Base Chinese Extractive Qa

基于RoBERTa架构的中文抽取式问答模型，适用于从给定文本中提取答案的任务。

智启未来，您的人工智能解决方案智库