開源rna_torsionBERT模型 - 免費實現從RNA序列精準預測扭轉角與偽扭轉角

首頁

Rna Torsionbert

由sayby開發

基於BERT架構的RNA扭轉角預測模型，用於從RNA序列預測扭轉角和偽扭轉角

蛋白質模型

Transformers

開源協議:其他 #RNA結構預測 #扭轉角計算 #生物分子建模

下載量 20.86k

發布時間 : 1/24/2024

模型概述

該模型是在約4200個RNA結構上進行預訓練的DNABERT改進版本，專門用於從RNA序列預測扭轉角和偽扭轉角。相較於現有最優模型或傳統方法推導的角度值，本模型在測試集上的MCQ評估指標表現更優。

模型特點

高精度扭轉角預測

在RNA-Puzzles和CASP-RNA測試集上表現優於現有最優模型

長序列支持

支持最長512個核苷酸的序列預測

多角度預測

可預測16種不同的扭轉角和偽扭轉角

模型能力

RNA序列分析

扭轉角預測

偽扭轉角預測

使用案例

RNA結構研究

RNA三級結構預測

通過預測的扭轉角輔助RNA三級結構建模

提高RNA結構預測的準確性

RNA功能分析

利用扭轉角信息分析RNA分子的功能特性

幫助理解RNA分子結構與功能的關係

🚀 `RNA-TorsionBERT`

RNA-TorsionBERT 是一個基於BERT的語言模型，參數大小為86.9 MB，可根據RNA序列預測其扭轉角和偽扭轉角。該模型在約4200個RNA結構上進行了預訓練，為RNA相關研究提供了有力支持。

🚀 快速開始

使用以下代碼片段，即可開始使用 RNA-TorsionBERT 生成文本：

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sayby/rna_torsionbert", trust_remote_code=True)
model = AutoModel.from_pretrained("sayby/rna_torsionbert", trust_remote_code=True)

sequence = "ACG CGG GGT GTT"
params_tokenizer = {
    "return_tensors": "pt",
    "padding": "max_length",
    "max_length": 512,
    "truncation": True,
}
inputs = tokenizer(sequence, **params_tokenizer)
output = model(inputs)["logits"]

⚠️ 重要提示

該模型是從DNABERT - 3模型微調而來，因此分詞器與DNABERT使用的分詞器相同。輸入序列中的核苷酸 U 應替換為 T。

輸出是每個角度的正弦和餘弦值。角度順序如下：alpha、beta、gamma、delta、epsilon、zeta、chi、eta、theta、eta'、theta'、v0、v1、v2、v3、v4。

若要將預測結果轉換為角度，可使用以下代碼片段：

import transformers
from transformers import AutoModel, AutoTokenizer
import numpy as np
import pandas as pd
from typing import Optional, Dict
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"

transformers.logging.set_verbosity_error()


BACKBONE = [
    "alpha",
    "beta",
    "gamma",
    "delta",
    "epsilon",
    "zeta",
    "chi",
    "eta",
    "theta",
    "eta'",
    "theta'",
    "v0",
    "v1",
    "v2",
    "v3",
    "v4",
]


class RNATorsionBERTHelper:
    def __init__(self):
        self.model_name = "sayby/rna_torsionbert"
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.model_name, trust_remote_code=True
        )
        self.params_tokenizer = {
            "return_tensors": "pt",
            "padding": "max_length",
            "max_length": 512,
            "truncation": True,
        }
        self.model = AutoModel.from_pretrained(self.model_name, trust_remote_code=True)

    def predict(self, sequence: str):
        sequence_tok = self.convert_raw_sequence_to_k_mers(sequence)
        inputs = self.tokenizer(sequence_tok, **self.params_tokenizer)
        outputs = self.model(inputs)["logits"]
        outputs = self.convert_sin_cos_to_angles(
            outputs.cpu().detach().numpy(), inputs["input_ids"]
        )
        output_angles = self.convert_logits_to_dict(
            outputs[0, :], inputs["input_ids"][0, :].cpu().detach().numpy()
        )
        output_angles.index = list(sequence)[:-2]  # Because of the 3-mer representation
        return output_angles

    def convert_raw_sequence_to_k_mers(self, sequence: str, k_mers: int = 3):
        """
        Convert a raw RNA sequence into sequence readable for the tokenizer.
        It converts the sequence into k-mers, and replace U by T
        :return: input readable by the tokenizer
        """
        sequence = sequence.upper().replace("U", "T")
        k_mers_sequence = [
            sequence[i : i + k_mers]
            for i in range(len(sequence))
            if len(sequence[i : i + k_mers]) == k_mers
        ]
        return " ".join(k_mers_sequence)

    def convert_sin_cos_to_angles(
        self, output: np.ndarray, input_ids: Optional[np.ndarray] = None
    ):
        """
        Convert the raw predictions of the RNA-TorsionBERT into angles.
        It converts the cos and sinus into angles using:
            alpha = arctan(sin(alpha)/cos(alpha))
        :param output: Dictionary with the predictions of the RNA-TorsionBERT per angle
        :param input_ids: the input_ids of the RNA-TorsionBERT. It allows to only select the of the sequence,
            and not the special tokens.
        :return: a np.ndarray with the angles for the sequence
        """
        if input_ids is not None:
            output[
                (input_ids == 0)
                | (input_ids == 2)
                | (input_ids == 3)
                | (input_ids == 4)
            ] = np.nan
        pair_indexes, impair_indexes = np.arange(0, output.shape[-1], 2), np.arange(
            1, output.shape[-1], 2
        )
        sin, cos = output[:, :, impair_indexes], output[:, :, pair_indexes]
        tan = np.arctan2(sin, cos)
        angles = np.degrees(tan)
        return angles

    def convert_logits_to_dict(self, output: np.ndarray, input_ids: np.ndarray) -> Dict:
        """
        Convert the raw predictions into dictionary format.
        It removes the special tokens and only keeps the predictions for the sequence.
        :param output: predictions from the models in angles
        :param input_ids: input ids from the tokenizer
        :return: a dictionary with the predictions for each angle
        """
        index_start, index_end = (
            np.where(input_ids == 2)[0][0],
            np.where(input_ids == 3)[0][0],
        )
        output_non_pad = output[index_start + 1 : index_end, :]
        output_angles = {
            angle: output_non_pad[:, angle_index]
            for angle_index, angle in enumerate(BACKBONE)
        }
        out = pd.DataFrame(output_angles)
        return out


if __name__ == "__main__":
    sequence = "AGGGCUUUAGUCUUUGGAG"
    rna_torsionbert_helper = RNATorsionBERTHelper()
    output_angles = rna_torsionbert_helper.predict(sequence)
    print(output_angles)