Geneformer Open-Source Model - Achieving Context-Aware Prediction of Biological Networks for Data-Scarce Scenarios

Geneformer

Developed by tdc

Geneformer is a Transformer model pre-trained on large-scale single-cell transcriptome data, specifically designed for scenarios with scarce network biology data, enabling context-aware predictions.

Molecular Model

Transformers

Open Source License:Apache-2.0 #Single-cell transcriptome analysis #Gene network modeling #Transfer learning

Downloads 1,127

Release Time : 7/21/2024

Model Overview

Geneformer is a deep learning model based on attention mechanisms, pre-trained on approximately 30 million single-cell transcriptome datasets, capable of achieving context-specific predictions in scenarios with limited network biology data.

Model Features

Large-scale pre-training

Pre-trained on approximately 30 million single-cell transcriptome datasets, encoding network hierarchical structures.

Context-aware prediction

Capable of understanding dynamic changes in gene networks, achieving context-specific predictions.

Transfer learning capability

Can be applied to diverse downstream tasks with minimal task-specific data fine-tuning.

Self-supervised learning

Pre-training process is entirely self-supervised, requiring no manually labeled data.

Model Capabilities

Single-cell transcriptome analysis

Gene network prediction

Candidate therapeutic target identification

Network dynamics understanding

Use Cases

Medical research

Cardiomyopathy therapeutic target identification

Successfully identified candidate therapeutic targets for cardiomyopathy with limited patient data.

Improved prediction accuracy and accelerated discovery of key network regulators.

Rare disease research

Rare disease gene network analysis

Analyzed gene networks in rare disease research with scarce data by fine-tuning Geneformer.

Significantly enhanced predictive capability in data-limited scenarios.

🚀 Geneformer

Geneformer is a foundational transformer model. It's pretrained on a large - scale corpus of single - cell transcriptomes, enabling context - aware predictions in network biology settings with limited data.

🚀 Quick Start

Geneformer is a foundational transformer model pretrained on a large - scale corpus of single cell transcriptomes to enable context - aware predictions in settings with limited data in network biology.

📚 Documentation

Abstract

Mapping gene networks requires large amounts of transcriptomic data to learn the connections between genes. This impedes discoveries in settings with limited data, including rare diseases and diseases affecting clinically inaccessible tissues. Recently, transfer learning has revolutionized fields such as natural language understanding and computer vision. It does this by leveraging deep learning models pretrained on large - scale general datasets, which can then be fine - tuned towards a vast array of downstream tasks with limited task - specific data. Here, we developed a context - aware, attention - based deep learning model, Geneformer. It's pretrained on a large - scale corpus of about 30 million single - cell transcriptomes to enable context - specific predictions in settings with limited data in network biology. During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the attention weights of the model in a completely self - supervised manner. Fine - tuning towards a diverse panel of downstream tasks relevant to chromatin and network dynamics using limited task - specific data demonstrated that Geneformer consistently boosted predictive accuracy. Applied to disease modelling with limited patient data, Geneformer identified candidate therapeutic targets for cardiomyopathy. Overall, Geneformer represents a pretrained deep learning model from which fine - tuning towards a broad range of downstream applications can be pursued to accelerate discovery of key network regulators and candidate therapeutic targets.

💻 Usage Examples

Basic Usage

from tdc.model_server.tokenizers.geneformer import GeneformerTokenizer
from tdc import tdc_hf_interface
import torch
# Retrieve anndata object. Then, tokenize
tokenizer = GeneformerTokenizer()
x = tokenizer.tokenize_cell_vectors(adata,
                                    ensembl_id="feature_id",
                                    ncounts="n_measured_vars")
cells, _ = x
input_tensor = torch.tensor(cells) # note that you may need to pad or perform other custom data processing

# retrieve model
geneformer = tdc_hf_interface("Geneformer")
model = geneformer.load()

# run inference
attention_mask = torch.tensor(
    [[x[0] != 0, x[1] != 0] for x in input_tensor]) # here we assume we used 0/False as a special padding token
outputs = model(batch,
                attention_mask=attention_mask,
                output_hidden_states=True)
layer_to_quant = quant_layers(model) + (
    -1
)  # Geneformer's second-to-last layer is most generalized
embs_i = outputs.hidden_states[layer_to_quant]
# there are "cls", "cell", and "gene" embeddings. we will only capture "gene", which is cell type specific. for "cell", you'd average out across unmasked gene embeddings per cell
embs = embs_i

📄 License

The model uses the Apache - 2.0 license.

📚 Citations

TDC Citation

@inproceedings{
velez - arce2024signals,
title={Signals in the Cells: Multimodal and Contextualized Machine Learning Foundations for Therapeutics},
author={Alejandro Velez - Arce and Xiang Lin and Kexin Huang and Michelle M Li and Wenhao Gao and Bradley Pentelute and Tianfan Fu and Manolis Kellis and Marinka Zitnik},
booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
year={2024},
url={https://openreview.net/forum?id=kL8dlYp6IM}
}

Additional Citations

C V Theodoris#, L Xiao, A Chopra, M D Chaffin, Z R Al Sayed, M C Hill, H Mantineo, E Brydon, Z Zeng, X S Liu, P T Ellinor#. Transfer learning enables predictions in network biology. Nature, 31 May 2023. (#co - corresponding authors)
H Chen*, M S Venkatesh*, J Gomez Ortega, S V Mahesh, T Nandi, R Madduri, K Pelka†, C V Theodoris†#. Quantized multi - task learning for context - specific representations of gene network dynamics. bioRxiv, 19 Aug 2024. (*co - first authors, †co - senior authors, #corresponding author)

🔗 Model HF Homepage

https://huggingface.co/ctheodoris/Geneformer

💡 Usage Tip

💡 Usage Tip

We use the 20L - 95M - i4096 release of Geneformer on TDC. This model is trained on the 95M version of Genecorpus.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご