This repository contains the code that accompanies our paper, "Consequences of training data composition for deep generative models in single-cell biology". You can find the preprint of the paper here.
Foundation models for single-cell transcriptomics have the potential to augment (or replace) purpose-built tools for a variety of common analyses, especially when data are sparse. In the field of large language models, training data composition greatly shapes performance; however, to date, single-cell foundation models have largely ignored this detail, opting instead to train on the largest possible corpus. Focusing on human hematopoiesis, we trained and analyzed deep generative models with various datasets, including cells from adult and developing tissues, disease states, and perturbation atlases. We find that (1) these models generalize poorly to unseen cell types, (2) adding malignant cells to a healthy cell training corpus does not necessarily improve modeling of unseen malignant cells, and (3) including an embryonic stem cell transcription factor differentiation atlas during training improves performance on out-of-distribution tasks. Our findings emphasize the importance of diverse training data and suggest strategies to optimize future single-cell foundation models.
For LDVAE analyses, you can recreate the necessary conda environment using scvi-env-3.txt.
Using the submodules provided, install Geneformer:
git lfs install
git clone https://github.com/lcrawlab/Geneformer.git
cd Geneformer
pip install .
For zero-shot Geneformer evaluations, install:
git clone https://github.com/microsoft/zero-shot-scfoundation
cd sc_foundation_evals
pip install .
Scripts to reproduce our analyses are found in three folders:
Preprocesscontains scripts to wrangle and quality controlled (QC) downloaded data.Traincontains scripts to train and fine-tune LDVAE and Geneformer models. Each model has its own subdirectory.Evaluationcontains scripts to compute reconstruction accuracies for LDVAE models. It also contains scripts for evaluation Geneformer models in a zero-shot setting. Each model has its own subdirectory.
This folder contains three scripts:
preprocess_data_bloodbase.pygenerates training datasets for the Blood Baseline set of experiments visualized in Figure 2 and Figure S2. It takes as input a random seed that is used for subsetting datasets.preprocess_data_allbase.pygenerates training datasets for the Atlas Baseline set of experiments visualized in Figure S4 and Figure S6. It takes as input a random seed that is used for subsetting datasets.preprocess_eval_data.pygenerates evaluation datasets.
These scripts require the users download several publicly available datasets. These datasets are described below.
To implement the Blood or Atlas (baseline) models, the relevant scTab datasets can be downloaded using the instructions here listed under the README header "Download and concatenate scTab". For these datasets, we downloaded 10% of the scTab dataset. Distinct subsets were used for training and evaluation. Similar instructions should be used to download the BoneMarrow dataset. Due to the relatively small number of bone marrow cells in scTab, we downloaded 100% of the resource prior to subsetting. Distinct subsets were used for training and evaluation.
The hematopoietic malignancy data can be downloaded here. Studies to be retained can be found in the methods section of the main manuscript. The script we used to generate the file cca_Hematologic_aggregated.h5ad from the downloaded files is data_wrangling_scripts/cca_wrangle.ipynb. The Ji et al. (2020) squamous cell carcinoma (SCC) evaluation dataset can be downloaded here.
The Perturb-Seq datasets, in the form of MEX files, can be downloaded using the GEO accession GSE264667 for the Jurkat experiment and here for the K562 data. The script we used to generate the files K562_essential_raw_singlecell_01_mex_collated.h5ad and GSE264667_jurkat_raw_singlecell_01_mex_collated.h5ad is data_wrangling_scripts/collate_weissman_MEX.ipynb.
For the TFAtlas dataset, the file GSE217460_210322_TFAtlas_subsample_raw.h5ad can be downloaded directly from GEO accession GSE217460. Preprocessing this dataset also makes use of the publicly available GTeX file GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct, which can be found here.
The Human Brain Cell Atlas neuron dataset can be downloaded here. The script we used to generate the file Neurons_H18.30.002_10Ksubset.h5ad is data_wrangling_scripts/explore_subset_SilettiNeuron.ipynb.
This folder contains scripts to train LDVAE and Geneformer models.
In the ldvae subfolder:
- The
Train_Models.pyscript trains Blood- and Atlas-baseline LDVAE models using the scvi-tools package. It takes a random seed as input. It outputs trained models as well as training curves. For details on the training parameters and model architecture, please see the Methods section of the manuscript.
In the geneformer subfolder:
- The
pretrain_geneformer.pyscript is used to pre-train new geneformer models. Before pre-training a geneformer model, the test/train/val splits of the data must be tokenized usingtokenize_data.py.
This folder contains two subfolder for evaluation LDVAE and Geneformer models.
The ldvae subfolder contains two scripts:
LDVAE_eval.pyestimates reconstruction accuracies for all model/evaluation combinations.LDVAE_eval_class.pydefines a python class containing a method for estimating reconstruction accuracy. It also contains utilities to (1) create a sample input/reconstruction scatterplot, (2) obtain the latent representation of a dataset from a particular model, and (3) compute expression reconstruction residuals.
The geneformer subfolder contains one script:
zeroshot_eval_geneformer.pyextracts embeddings and evaluates the models zero-shot performance. This script takes in./Training/geneformer/adata_var.csv
If you have any questions, or find any issues with the code, please open an issue in this repository. We also welcome any contributions to the code - be sure to checkout the Contributing section below.
If you have questions or concerns with this project and do not want to create an issue, please contact Ajay Nadig or Lorin Crawford. Any feedback on the software, manuscript, and tutorials is appreciated.
@article {ID,
author = {Nadig, Ajay and Thoutam, Akshaya and Hughes, Madeline and Gupta, Anay and Navia, Andrew W. and Fusi, Nicolo and Raghavan, Srivatsan and Winter, Peter S. and Amini, Ava P. and Crawford, Lorin},
title = {Consequences of training data composition for deep generative models in single-cell biology},
elocation-id = {2025.02.19.639127},
year = {2025},
doi = {10.1101/2025.02.19.639127},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/02/24/2025.02.19.639127},
eprint = {https://www.biorxiv.org/content/early/2025/02/24/2025.02.19.639127.full.pdf},
journal = {bioRxiv}
}
This project is available under the MIT License.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
