🚀 Geneformer
Geneformer is a foundational transformer model. It's pretrained on a large - scale single - cell transcriptome corpus, enabling context - aware predictions in network biology with limited data.
🚀 Quick Start
- For details of the original model trained on ~30 million transcriptomes in June 2021 and the initial report of in silico perturbation and cell and gene classification strategies, see our manuscript.
- For details of the expanded model trained on ~95 million transcriptomes in April 2024 and our continual learning, multitask learning, and quantization strategies, see our manuscript.
- For documentation, see geneformer.readthedocs.io.
✨ Features
Model Description
Geneformer is a foundational transformer model pretrained on a large - scale corpus of single - cell transcriptomes representing a wide range of human tissues. Initially, in June 2021, it was pretrained on [Genecorpus - 30M](https://huggingface.co/datasets/ctheodoris/Genecorpus - 30M), a corpus with ~30 million single - cell transcriptomes. Cells with high mutational burdens (e.g., malignant cells and immortalized cell lines) were excluded to facilitate interpretation without companion genome sequencing. In April 2024, it was pretrained on ~95 million non - cancer transcriptomes, followed by continual learning on ~14 million cancer transcriptomes to get a cancer domain - tuned model.
Each single cell's transcriptome is presented to the model as a rank value encoding. Genes are ranked by their expression in that cell, scaled by their expression across the entire Genecorpus - 30M. This encoding provides a non - parametric representation of the cell's transcriptome, prioritizing genes that distinguish cell state. It deprioritizes ubiquitously highly - expressed housekeeping genes and boosts the rank of genes like transcription factors that can distinguish cell state. Also, this rank - based approach is more robust against technical artifacts that may bias absolute transcript counts.
The rank value encoding of each single cell's transcriptome passes through N layers of transformer encoder units (N depends on the model size). Pretraining used a masked learning objective, where 15% of genes in each transcriptome were masked, and the model was trained to predict the masked genes using the context of unmasked genes. This approach is self - supervised and can use unlabeled data, allowing large - scale training.
We detail applications and results in our manuscript. During pretraining, Geneformer understood network dynamics and encoded network hierarchy in its attention weights. With zero - shot learning and fine - tuning with limited data, it improved predictive accuracy in downstream tasks related to chromatin and network dynamics. For example, in silico perturbation identified a novel transcription factor in cardiomyocytes, and in silico treatment found candidate therapeutic targets for cardiomyopathy.
The repository includes the following pretrained models:
L = layers
M = millions of cells used for pretraining
i = input size
(pretraining date)
- GF - 6L - 30M - i2048 (June 2021)
- GF - 12L - 30M - i2048 (June 2021)
- GF - 12L - 95M - i4096 (April 2024)
- GF - 20L - 95M - i4096 (April 2024)
The current default model in the main directory of the repository is GF - 12L - 95M - i4096. The repository also has fine - tuned models in the fine_tuned_models directory, including the cancer - tuned model GF - 12L - 95M - i4096_CLcancer.
Application
The pretrained Geneformer model can be used for zero - shot learning (e.g., in silico perturbation analysis) or fine - tuning for downstream tasks (e.g., gene or cell state classification).
Example applications demonstrated in our manuscript include:
Fine - tuning
- Transcription factor dosage sensitivity
- Chromatin dynamics (bivalently marked promoters)
- Transcription factor regulatory range
- Gene network centrality
- Transcription factor targets
- Cell type annotation
- Batch integration
- Cell state classification across differentiation
- Disease classification
- In silico perturbation to determine disease - driving genes
- In silico treatment to determine candidate therapeutic targets
Zero - shot learning
- Batch integration
- Gene context specificity
- In silico reprogramming
- In silico differentiation
- In silico perturbation to determine impact on cell state
- In silico perturbation to determine transcription factor targets
- In silico perturbation to determine transcription factor cooperativity
📦 Installation
Besides the pretrained model, this repository has functions for tokenizing and collating single - cell transcriptomics data, pretraining, fine - tuning, extracting and plotting cell embeddings, and in silico perturbation. To install (~20s):
git lfs install
git clone https://huggingface.co/ctheodoris/Geneformer
cd Geneformer
pip install.
For usage, see examples for:
- Tokenizing transcriptomes
- Pretraining
- Hyperparameter tuning
- Fine - tuning
- Extracting and plotting cell embeddings
- In silico perturbation
⚠️ Important Note
The fine - tuning examples are generally applicable, and input datasets and labels depend on the downstream task. Example input files for some downstream tasks are in the [example_input_files directory](https://huggingface.co/datasets/ctheodoris/Genecorpus - 30M/tree/main/example_input_files) of the dataset repository, but they only represent a few fine - tuning applications.
GPU resources are required for efficient usage of Geneformer. Also, we strongly recommend tuning hyperparameters for each downstream fine - tuning application as it can significantly boost predictive potential in the downstream task (e.g., max learning rate, learning schedule, number of layers to freeze, etc.).
📄 License
The license for this project is apache - 2.0.
📚 Documentation
Citations
- C V Theodoris#, L Xiao, A Chopra, M D Chaffin, Z R Al Sayed, M C Hill, H Mantineo, E Brydon, Z Zeng, X S Liu, P T Ellinor#. Transfer learning enables predictions in network biology. Nature, 31 May 2023. (#co - corresponding authors)
- H Chen*, M S Venkatesh*, J Gomez Ortega, S V Mahesh, T Nandi, R Madduri, K Pelka†, C V Theodoris†#. Quantized multi - task learning for context - specific representations of gene network dynamics. bioRxiv, 19 Aug 2024. (*co - first authors, †co - senior authors, #corresponding author)
Information Table
Property |
Details |
Model Type |
Geneformer, a foundational transformer model |
Training Data |
Initially ~30 million single - cell transcriptomes from Genecorpus - 30M in June 2021, then ~95 million non - cancer transcriptomes in April 2024, followed by continual learning on ~14 million cancer transcriptomes |