🚀 Isoformer
Isoformer is a model that can accurately predict differential transcript expression. It outperforms existing methods and makes use of multiple modalities. Our framework effectively transfers knowledge from three pre - trained encoders: Enformer for the DNA modality, Nucleotide Transformer v2 for the RNA modality, and ESM2 for the protein modality.
🚀 Quick Start
✨ Features
- Accurately predict differential transcript expression.
- Leverage multiple modalities and transfer knowledge from pre - trained encoders.
📦 Installation
The installation steps are not provided in the original document, so this section is skipped.
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
import numpy as np
import torch
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/isoformer", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/isoformer",trust_remote_code=True)
protein_sequences = ["RSRSRSRSRSRSRSRSRSRSRL" * 9]
rna_sequences = ["ATTCCGGTTTTCA" * 9]
sequence_length = 196_608
rng = np.random.default_rng(seed=0)
dna_sequences = ["".join(rng.choice(list("ATCGN"), size=(sequence_length,)))]
torch_tokens = tokenizer(
dna_input=dna_sequences, rna_input=rna_sequences, protein_input=protein_sequences
)
dna_torch_tokens = torch.tensor(torch_tokens[0]["input_ids"])
rna_torch_tokens = torch.tensor(torch_tokens[1]["input_ids"])
protein_torch_tokens = torch.tensor(torch_tokens[2]["input_ids"])
torch_output = model.forward(
tensor_dna=dna_torch_tokens,
tensor_rna=rna_torch_tokens,
tensor_protein=protein_torch_tokens,
attention_mask_rna=rna_torch_tokens != 1,
attention_mask_protein=protein_torch_tokens != 1,
)
print(f"Gene expression predictions: {torch_output['gene_expression_predictions']}")
print(f"Final DNA embedding: {torch_output['final_dna_embeddings']}")
📚 Documentation
📄 Information Table
Property |
Details |
Tags |
DNA, RNA, protein, biology, genomics |
Datasets |
InstaDeepAI/multi_omics_transcript_expression |
Developed by |
InstaDeep |
🔧 Technical Details
Isoformer is trained on RNA transcript expression data obtained from the GTex portal. The data consists of Transcript TPMs measurements across 30 tissues from more than 5000 individuals. In total, the dataset contains ∼170k unique transcripts, 90k of which are protein - coding and correspond to ∼20k unique genes.
📄 License
The license information is not provided in the original document, so this section is skipped.