Geneformer Open-source Model - Capable of Network Biology Prediction Based on Transcriptomic Data

Home

Geneformer

Developed by ctheodoris

A Transformer model pre-trained on large-scale single-cell transcriptome corpora for network biology prediction

Molecular Model

Transformers

Open Source License:Apache-2.0 #Single-cell transcriptome analysis #Gene network prediction #Zero-shot learning

Downloads 8,365

Release Time : 3/12/2022

Model Overview

Geneformer is a foundational Transformer model pre-trained on large-scale single-cell transcriptomes, capable of context-aware predictions in scenarios with limited network biology data, supporting zero-shot learning and fine-tuning applications.

Model Features

Context-aware prediction

Captures hierarchical network structures among genes through attention mechanisms to achieve context-dependent biological predictions

Non-parametric representation

Uses gene ranking value encoding to enhance robustness against technical noise and highlight key genes

Multi-scale models

Provides pre-trained models ranging from 6 to 20 layers to accommodate different computational needs

Continual learning capability

Supports domain-specific tuning with additional data (e.g., cancer transcriptomes)

Model Capabilities

Single-cell transcriptome tokenization

Virtual perturbation analysis

Gene network dynamic modeling

Cell state classification

Disease target discovery

Batch effect correction

Use Cases

Basic research

Transcription factor discovery

Identifies novel transcription factors in cardiomyocytes through zero-shot virtual perturbation

Experimentally validated as crucial for contractile function

Chromatin dynamics analysis

Predicts epigenetic states of bivalent-marked promoters

Clinical research

Disease therapeutic target discovery

Proposes cardiomyopathy targets based on limited patient data

Significantly improves cardiomyocyte contractility in iPSC disease models

Cancer-specific analysis

Identifies tumor-specific network changes using cancer-tuned versions

🚀 Geneformer

Geneformer is a foundational transformer model. It's pretrained on a large - scale single - cell transcriptome corpus, enabling context - aware predictions in network biology with limited data.

🚀 Quick Start

For details of the original model trained on ~30 million transcriptomes in June 2021 and the initial report of in silico perturbation and cell and gene classification strategies, see our manuscript.
For details of the expanded model trained on ~95 million transcriptomes in April 2024 and our continual learning, multitask learning, and quantization strategies, see our manuscript.
For documentation, see geneformer.readthedocs.io.

✨ Features

Model Description

Geneformer is a foundational transformer model pretrained on a large - scale corpus of single - cell transcriptomes representing a wide range of human tissues. Initially, in June 2021, it was pretrained on [Genecorpus - 30M](https://huggingface.co/datasets/ctheodoris/Genecorpus - 30M), a corpus with ~30 million single - cell transcriptomes. Cells with high mutational burdens (e.g., malignant cells and immortalized cell lines) were excluded to facilitate interpretation without companion genome sequencing. In April 2024, it was pretrained on ~95 million non - cancer transcriptomes, followed by continual learning on ~14 million cancer transcriptomes to get a cancer domain - tuned model.

Each single cell's transcriptome is presented to the model as a rank value encoding. Genes are ranked by their expression in that cell, scaled by their expression across the entire Genecorpus - 30M. This encoding provides a non - parametric representation of the cell's transcriptome, prioritizing genes that distinguish cell state. It deprioritizes ubiquitously highly - expressed housekeeping genes and boosts the rank of genes like transcription factors that can distinguish cell state. Also, this rank - based approach is more robust against technical artifacts that may bias absolute transcript counts.

The rank value encoding of each single cell's transcriptome passes through N layers of transformer encoder units (N depends on the model size). Pretraining used a masked learning objective, where 15% of genes in each transcriptome were masked, and the model was trained to predict the masked genes using the context of unmasked genes. This approach is self - supervised and can use unlabeled data, allowing large - scale training.

We detail applications and results in our manuscript. During pretraining, Geneformer understood network dynamics and encoded network hierarchy in its attention weights. With zero - shot learning and fine - tuning with limited data, it improved predictive accuracy in downstream tasks related to chromatin and network dynamics. For example, in silico perturbation identified a novel transcription factor in cardiomyocytes, and in silico treatment found candidate therapeutic targets for cardiomyopathy.

The repository includes the following pretrained models: L = layers M = millions of cells used for pretraining i = input size (pretraining date)

GF - 6L - 30M - i2048 (June 2021)
GF - 12L - 30M - i2048 (June 2021)
GF - 12L - 95M - i4096 (April 2024)
GF - 20L - 95M - i4096 (April 2024)

The current default model in the main directory of the repository is GF - 12L - 95M - i4096. The repository also has fine - tuned models in the fine_tuned_models directory, including the cancer - tuned model GF - 12L - 95M - i4096_CLcancer.

Application

The pretrained Geneformer model can be used for zero - shot learning (e.g., in silico perturbation analysis) or fine - tuning for downstream tasks (e.g., gene or cell state classification).

Example applications demonstrated in our manuscript include:

Fine - tuning

Transcription factor dosage sensitivity
Chromatin dynamics (bivalently marked promoters)
Transcription factor regulatory range
Gene network centrality
Transcription factor targets
Cell type annotation
Batch integration
Cell state classification across differentiation
Disease classification
In silico perturbation to determine disease - driving genes
In silico treatment to determine candidate therapeutic targets

Zero - shot learning

Batch integration
Gene context specificity
In silico reprogramming
In silico differentiation
In silico perturbation to determine impact on cell state
In silico perturbation to determine transcription factor targets
In silico perturbation to determine transcription factor cooperativity

📦 Installation

Besides the pretrained model, this repository has functions for tokenizing and collating single - cell transcriptomics data, pretraining, fine - tuning, extracting and plotting cell embeddings, and in silico perturbation. To install (~20s):

# Make sure you have git - lfs installed (https://git - lfs.com)
git lfs install
git clone https://huggingface.co/ctheodoris/Geneformer
cd Geneformer
pip install.

For usage, see examples for:

Tokenizing transcriptomes
Pretraining
Hyperparameter tuning
Fine - tuning
Extracting and plotting cell embeddings
In silico perturbation

⚠️ Important Note

The fine - tuning examples are generally applicable, and input datasets and labels depend on the downstream task. Example input files for some downstream tasks are in the [example_input_files directory](https://huggingface.co/datasets/ctheodoris/Genecorpus - 30M/tree/main/example_input_files) of the dataset repository, but they only represent a few fine - tuning applications.

GPU resources are required for efficient usage of Geneformer. Also, we strongly recommend tuning hyperparameters for each downstream fine - tuning application as it can significantly boost predictive potential in the downstream task (e.g., max learning rate, learning schedule, number of layers to freeze, etc.).

📄 License

The license for this project is apache - 2.0.

📚 Documentation

Citations

C V Theodoris#, L Xiao, A Chopra, M D Chaffin, Z R Al Sayed, M C Hill, H Mantineo, E Brydon, Z Zeng, X S Liu, P T Ellinor#. Transfer learning enables predictions in network biology. Nature, 31 May 2023. (#co - corresponding authors)
H Chen*, M S Venkatesh*, J Gomez Ortega, S V Mahesh, T Nandi, R Madduri, K Pelka†, C V Theodoris†#. Quantized multi - task learning for context - specific representations of gene network dynamics. bioRxiv, 19 Aug 2024. (*co - first authors, †co - senior authors, #corresponding author)

Information Table

Property	Details
Model Type	Geneformer, a foundational transformer model
Training Data	Initially ~30 million single - cell transcriptomes from Genecorpus - 30M in June 2021, then ~95 million non - cancer transcriptomes in April 2024, followed by continual learning on ~14 million cancer transcriptomes

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご