RNABERT開源模型 - 基於非編碼RNA，助力生物相關預訓練分析應用

首頁

Rnabert

由multimolecule開發

RNABERT是基於非編碼RNA（ncRNA）的預訓練模型，採用掩碼語言建模（MLM）和結構對齊學習（SAL）目標。

分子模型

Safetensors

其他#RNA序列分析 #結構對齊預測 #非編碼RNA研究

下載量 8,166

發布時間 : 9/10/2024

模型概述

RNABERT是一個bert風格的模型，通過自監督方式在大量非編碼RNA序列上進行預訓練，主要用於RNA序列的特徵提取和結構對齊。

模型特點

雙目標預訓練

同時採用掩碼語言建模(MLM)和結構對齊學習(SAL)兩種預訓練目標

RNA專用模型

專門針對非編碼RNA序列設計和訓練

輕量級架構

僅0.48M參數，適合RNA序列處理任務

模型能力

RNA序列特徵提取

RNA結構對齊預測

RNA序列掩碼預測

使用案例

生物信息學

RNA功能聚類

利用模型提取的RNA序列特徵進行功能聚類分析

RNA結構對齊

預測兩個RNA序列之間的結構對齊關係

🚀 RNABERT

RNABERT是一個基於自監督學習方式，在大量非編碼RNA（ncRNA）序列語料庫上預訓練的BERT風格模型。該模型僅在RNA序列的原始核苷酸上進行訓練，通過自動流程從這些文本中生成輸入和標籤。

🚀 快速開始

本模型文件依賴於multimolecule庫，你可以使用pip進行安裝：

pip install multimolecule

直接使用

你可以直接使用此模型進行掩碼語言建模：

>>> import multimolecule  # 你必須導入multimolecule以註冊模型
>>> from transformers import pipeline
>>> unmasker = pipeline("fill-mask", model="multimolecule/rnabert")
>>> unmasker("gguc<mask>cucugguuagaccagaucugagccu")

[{'score': 0.03852083534002304,
  'token': 24,
  'token_str': '-',
  'sequence': 'G G U C - C U C U G G U U A G A C C A G A U C U G A G C C U'},
 {'score': 0.03851056098937988,
  'token': 10,
  'token_str': 'N',
  'sequence': 'G G U C N C U C U G G U U A G A C C A G A U C U G A G C C U'},
 {'score': 0.03849703073501587,
  'token': 25,
  'token_str': 'I',
  'sequence': 'G G U C I C U C U G G U U A G A C C A G A U C U G A G C C U'},
 {'score': 0.03848597779870033,
  'token': 3,
  'token_str': '<unk>',
  'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'},
 {'score': 0.038484156131744385,
  'token': 5,
  'token_str': '<null>',
  'sequence': 'G G U C C U C U G G U U A G A C C A G A U C U G A G C C U'}]

下游使用

提取特徵

以下是如何在PyTorch中使用此模型獲取給定序列的特徵：

from multimolecule import RnaTokenizer, RnaBertModel

tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnabert")
model = RnaBertModel.from_pretrained("multimolecule/rnabert")

text = "UAGCUUAUCAGACUGAUGUUGA"
input = tokenizer(text, return_tensors="pt")

output = model(**input)

序列分類/迴歸

注意：此模型未針對任何特定任務進行微調。你需要在下游任務上微調模型，以將其用於序列分類或迴歸。以下是如何在PyTorch中使用此模型作為骨幹進行序列級任務的微調：

import torch
from multimolecule import RnaTokenizer, RnaBertForSequencePrediction

tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnabert")
model = RnaBertForSequencePrediction.from_pretrained("multimolecule/rnabert")

text = "UAGCUUAUCAGACUGAUGUUGA"
input = tokenizer(text, return_tensors="pt")
label = torch.tensor([1])

output = model(**input, labels=label)

標記分類/迴歸

注意：此模型未針對任何特定任務進行微調。你需要在下游任務上微調模型，以將其用於核苷酸分類或迴歸。以下是如何在PyTorch中使用此模型作為骨幹進行核苷酸級任務的微調：

import torch
from multimolecule import RnaTokenizer, RnaBertForTokenPrediction

tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnabert")
model = RnaBertForTokenPrediction.from_pretrained("multimolecule/rnabert")

text = "UAGCUUAUCAGACUGAUGUUGA"
input = tokenizer(text, return_tensors="pt")
label = torch.randint(2, (len(text), ))

output = model(**input, labels=label)

接觸分類/迴歸

注意：此模型未針對任何特定任務進行微調。你需要在下游任務上微調模型，以將其用於接觸分類或迴歸。以下是如何在PyTorch中使用此模型作為骨幹進行接觸級任務的微調：

import torch
from multimolecule import RnaTokenizer, RnaBertForContactPrediction

tokenizer = RnaTokenizer.from_pretrained("multimolecule/rnabert")
model = RnaBertForContactPrediction.from_pretrained("multimolecule/rnabert")

text = "UAGCUUAUCAGACUGAUGUUGA"
input = tokenizer(text, return_tensors="pt")
label = torch.randint(2, (len(text), len(text)))

output = model(**input, labels=label)

✨ 主要特性

RNABERT是一個基於自監督學習方式，在大量非編碼RNA序列語料庫上預訓練的BERT風格模型。該模型有兩個預訓練目標：掩碼語言建模（MLM）和結構對齊學習（SAL）。

📚 詳細文檔

模型詳情

RNABERT是一個基於自監督學習方式，在大量非編碼RNA序列語料庫上預訓練的BERT風格模型。這意味著該模型僅在RNA序列的原始核苷酸上進行訓練，通過自動流程從這些文本中生成輸入和標籤。有關訓練過程的更多信息，請參閱訓練詳情部分。

模型規格

層數	隱藏層大小	頭數	中間層大小	參數數量（M）	浮點運算次數（G）	乘累加運算次數（G）	最大標記數
6	120	12	40	0.48	0.15	0.08	440

鏈接

代碼：multimolecule.rnabert
權重：multimolecule/rnabert
數據：RNAcentral
論文：Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning
開發者：Manato Akiyama和Yasubumi Sakakibara
模型類型：BERT
原始倉庫：mana438/RNABERT

訓練詳情

RNABERT有兩個預訓練目標：掩碼語言建模（MLM）和結構對齊學習（SAL）。

掩碼語言建模（MLM）：給定一個序列，模型隨機掩碼輸入中15%的標記，然後將整個掩碼句子輸入模型，並預測被掩碼的標記。這類似於語言建模中的完形填空任務。
結構對齊學習（SAL）：模型學習預測兩個RNA序列的結構對齊。該模型使用Needleman-Wunsch算法訓練預測兩個RNA序列的對齊分數。

訓練數據

RNABERT模型在RNAcentral上進行預訓練。 RNAcentral是一個免費的公共資源，提供對由一組專家數據庫提供的全面且最新的非編碼RNA序列的集成訪問，這些數據庫代表了廣泛的生物和RNA類型。 RNABERT使用了RNAcentral中76,237個人類ncRNA序列的子集進行預訓練。 RNABERT通過將所有標記中的“U”替換為“T”來預處理所有標記。請注意，在模型轉換期間，“T”會被替換為“U”。[RnaTokenizer][multimolecule.RnaTokenizer]會為你將“T”轉換為“U”，你可以通過傳遞replace_T_with_U=False來禁用此行為。

訓練過程

預處理

RNABERT通過對72,237個人類ncRNA序列應用10種不同的掩碼模式來預處理數據集。最終數據集包含722,370個序列。掩碼過程類似於BERT中使用的過程：

15%的標記被掩碼。
在80%的情況下，被掩碼的標記被<mask>替換。
在10%的情況下，被掩碼的標記被一個與它們替換的標記不同的隨機標記替換。
在剩下的10%的情況下，被掩碼的標記保持不變。

預訓練

該模型在1塊NVIDIA V100 GPU上進行訓練。

免責聲明

這是Informative RNA base embedding for functional RNA clustering and structural alignment的非官方實現，作者是Manato Akiyama和Yasubumi Sakakibara。 RNABERT的官方倉庫位於mana438/RNABERT。

⚠️ 重要提示

MultiMolecule團隊意識到在復現RNABERT結果時存在潛在風險。

RNABERT的原始實現不會在輸入序列前添加<cls>並在末尾添加<eos>標記。在大多數情況下，這不會影響模型的性能，但在某些情況下可能會導致意外行為。

如果你希望實現與原始實現完全相同的行為，請在分詞器中明確設置cls_token=None和eos_token=None。

💡 使用建議

MultiMolecule團隊已確認提供的模型和檢查點與原始實現產生相同的中間表示。

引用

BibTeX：

@article{akiyama2022informative,
    author = {Akiyama, Manato and Sakakibara, Yasubumi},
    title = "{Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning}",
    journal = {NAR Genomics and Bioinformatics},
    volume = {4},
    number = {1},
    pages = {lqac012},
    year = {2022},
    month = {02},
    abstract = "{Effective embedding is actively conducted by applying deep learning to biomolecular information. Obtaining better embeddings enhances the quality of downstream analyses, such as DNA sequence motif detection and protein function prediction. In this study, we adopt a pre-training algorithm for the effective embedding of RNA bases to acquire semantically rich representations and apply this algorithm to two fundamental RNA sequence problems: structural alignment and clustering. By using the pre-training algorithm to embed the four bases of RNA in a position-dependent manner using a large number of RNA sequences from various RNA families, a context-sensitive embedding representation is obtained. As a result, not only base information but also secondary structure and context information of RNA sequences are embedded for each base. We call this ‘informative base embedding’ and use it to achieve accuracies superior to those of existing state-of-the-art methods on RNA structural alignment and RNA family clustering tasks. Furthermore, upon performing RNA sequence alignment by combining this informative base embedding with a simple Needleman–Wunsch alignment algorithm, we succeed in calculating structural alignments with a time complexity of O(n2) instead of the O(n6) time complexity of the naive implementation of Sankoff-style algorithm for input RNA sequence of length n.}",
    issn = {2631-9268},
    doi = {10.1093/nargab/lqac012},
    url = {https://doi.org/10.1093/nargab/lqac012},
    eprint = {https://academic.oup.com/nargab/article-pdf/4/1/lqac012/42577168/lqac012.pdf},
}

聯繫信息

如果你對本模型卡片有任何問題或建議，請使用MultiMolecule的GitHub問題。如果你對論文或模型有任何問題或建議，請聯繫RNABERT論文的作者。

📄 許可證

本模型採用AGPL-3.0許可證。

SPDX-License-Identifier: AGPL-3.0-or-later

精選推薦AI模型

Llama 3 Typhoon V1.5x 8b Instruct

專為泰語設計的80億參數指令模型，性能媲美GPT-3.5-turbo，優化了應用場景、檢索增強生成、受限生成和推理任務

Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型，專為邊緣設備推理設計，體積僅為Cosmo-3B模型的2%左右。

Roberta Base Chinese Extractive Qa

基於RoBERTa架構的中文抽取式問答模型，適用於從給定文本中提取答案的任務。

智啟未來，您的人工智能解決方案智庫