Isoformer Open-Source Model - Accurately Predict Differential Transcript Expression, outperforming Existing Methods

Isoformer

Developed by InstaDeepAI

Isoformer is a model capable of accurately predicting differential transcript expression, outperforming existing methods and fully leveraging multimodal data.

Protein Model

Transformers

#Multimodal Transcript Prediction #Genomic Expression Analysis #Cross-modal Knowledge Transfer

Downloads 165

Release Time : 5/13/2024

Model Overview

Isoformer is a model for predicting differential transcript expression, achieving efficient gene expression prediction by integrating data from three modalities: DNA, RNA, and protein.

Model Features

Multimodal Data Integration

Integrates data from three modalities—DNA, RNA, and protein—to enhance prediction accuracy.

Efficient Knowledge Transfer

Efficiently transfers knowledge from three pre-trained encoders: Enformer, Nucleotide Transformer v2, and ESM2.

High-performance Prediction

Outperforms existing methods in differential transcript expression prediction tasks.

Model Capabilities

Gene Expression Prediction

Multimodal Data Integration

Transcript Expression Analysis

Use Cases

Genomics Research

Differential Transcript Expression Prediction

Predicts transcript expression differences across tissues or conditions.

Superior prediction accuracy compared to existing methods.

🚀 Isoformer

Isoformer is a model that can accurately predict differential transcript expression. It outperforms existing methods and makes use of multiple modalities. Our framework effectively transfers knowledge from three pre - trained encoders: Enformer for the DNA modality, Nucleotide Transformer v2 for the RNA modality, and ESM2 for the protein modality.

🚀 Quick Start

✨ Features

Accurately predict differential transcript expression.
Leverage multiple modalities and transfer knowledge from pre - trained encoders.

📦 Installation

The installation steps are not provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM
import numpy as np
import torch

# Import the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/isoformer", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/isoformer",trust_remote_code=True)

protein_sequences = ["RSRSRSRSRSRSRSRSRSRSRL" * 9]
rna_sequences = ["ATTCCGGTTTTCA" * 9]
sequence_length = 196_608
rng = np.random.default_rng(seed=0)
dna_sequences = ["".join(rng.choice(list("ATCGN"), size=(sequence_length,)))]

torch_tokens = tokenizer(
    dna_input=dna_sequences, rna_input=rna_sequences, protein_input=protein_sequences
)
dna_torch_tokens = torch.tensor(torch_tokens[0]["input_ids"])
rna_torch_tokens = torch.tensor(torch_tokens[1]["input_ids"])
protein_torch_tokens = torch.tensor(torch_tokens[2]["input_ids"])

torch_output = model.forward(
    tensor_dna=dna_torch_tokens,
    tensor_rna=rna_torch_tokens,
    tensor_protein=protein_torch_tokens,
    attention_mask_rna=rna_torch_tokens != 1,
    attention_mask_protein=protein_torch_tokens != 1,
)

print(f"Gene expression predictions: {torch_output['gene_expression_predictions']}")
print(f"Final DNA embedding: {torch_output['final_dna_embeddings']}")

📚 Documentation

📄 Information Table

Property	Details
Tags	DNA, RNA, protein, biology, genomics
Datasets	InstaDeepAI/multi_omics_transcript_expression
Developed by	InstaDeep

🔧 Technical Details

Isoformer is trained on RNA transcript expression data obtained from the GTex portal. The data consists of Transcript TPMs measurements across 30 tissues from more than 5000 individuals. In total, the dataset contains ∼170k unique transcripts, 90k of which are protein - coding and correspond to ∼20k unique genes.

📄 License

The license information is not provided in the original document, so this section is skipped.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご