Consequences of training data composition for deep generative models in single-cell biology

This repository contains the code that accompanies our paper, "Consequences of training data composition for deep generative models in single-cell biology". You can find the preprint of the paper here.

Abstract

Foundation models for single-cell transcriptomics have the potential to augment (or replace) purpose-built tools for a variety of common analyses, especially when data are sparse. In the field of large language models, training data composition greatly shapes performance; however, to date, single-cell foundation models have largely ignored this detail, opting instead to train on the largest possible corpus. Focusing on human hematopoiesis, we trained and analyzed deep generative models with various datasets, including cells from adult and developing tissues, disease states, and perturbation atlases. We find that (1) these models generalize poorly to unseen cell types, (2) adding malignant cells to a healthy cell training corpus does not necessarily improve modeling of unseen malignant cells, and (3) including an embryonic stem cell transcription factor differentiation atlas during training improves performance on out-of-distribution tasks. Our findings emphasize the importance of diverse training data and suggest strategies to optimize future single-cell foundation models.

Dependencies

For LDVAE analyses, you can recreate the necessary conda environment using scvi-env-3.txt.

Using the submodules provided, install Geneformer:

git lfs install
git clone https://github.com/lcrawlab/Geneformer.git
cd Geneformer 
pip install .

For zero-shot Geneformer evaluations, install:

git clone https://github.com/microsoft/zero-shot-scfoundation
cd sc_foundation_evals
pip install .

Reproducing Results

Scripts to reproduce our analyses are found in three folders:

Preprocess contains scripts to wrangle and quality controlled (QC) downloaded data.
Train contains scripts to train and fine-tune LDVAE and Geneformer models. Each model has its own subdirectory.
Evaluation contains scripts to compute reconstruction accuracies for LDVAE models. It also contains scripts for evaluation Geneformer models in a zero-shot setting. Each model has its own subdirectory.

Datasets and Preprocessing

This folder contains three scripts:

preprocess_data_bloodbase.py generates training datasets for the Blood Baseline set of experiments visualized in Figure 2 and Figure S2. It takes as input a random seed that is used for subsetting datasets.
preprocess_data_allbase.py generates training datasets for the Atlas Baseline set of experiments visualized in Figure S4 and Figure S6. It takes as input a random seed that is used for subsetting datasets.
preprocess_eval_data.py generates evaluation datasets.

These scripts require the users download several publicly available datasets. These datasets are described below.

scTab

To implement the Blood or Atlas (baseline) models, the relevant scTab datasets can be downloaded using the instructions here listed under the README header "Download and concatenate scTab". For these datasets, we downloaded 10% of the scTab dataset. Distinct subsets were used for training and evaluation. Similar instructions should be used to download the BoneMarrow dataset. Due to the relatively small number of bone marrow cells in scTab, we downloaded 100% of the resource prior to subsetting. Distinct subsets were used for training and evaluation.

The Curated Cancer Cell Atlas (3CA)

The hematopoietic malignancy data can be downloaded here. Studies to be retained can be found in the methods section of the main manuscript. The script we used to generate the file cca_Hematologic_aggregated.h5ad from the downloaded files is data_wrangling_scripts/cca_wrangle.ipynb. The Ji et al. (2020) squamous cell carcinoma (SCC) evaluation dataset can be downloaded here.

K562 and Jurkat Perturb-seq

The Perturb-Seq datasets, in the form of MEX files, can be downloaded using the GEO accession GSE264667 for the Jurkat experiment and here for the K562 data. The script we used to generate the files K562_essential_raw_singlecell_01_mex_collated.h5ad and GSE264667_jurkat_raw_singlecell_01_mex_collated.h5ad is data_wrangling_scripts/collate_weissman_MEX.ipynb.

Transcription factor (TF) atlas

For the TFAtlas dataset, the file GSE217460_210322_TFAtlas_subsample_raw.h5ad can be downloaded directly from GEO accession GSE217460. Preprocessing this dataset also makes use of the publicly available GTeX file GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct, which can be found here.

Human Brain Cell Atlas (HBCA)

The Human Brain Cell Atlas neuron dataset can be downloaded here. The script we used to generate the file Neurons_H18.30.002_10Ksubset.h5ad is data_wrangling_scripts/explore_subset_SilettiNeuron.ipynb.

Training

This folder contains scripts to train LDVAE and Geneformer models.

In the ldvae subfolder:

The Train_Models.py script trains Blood- and Atlas-baseline LDVAE models using the scvi-tools package. It takes a random seed as input. It outputs trained models as well as training curves. For details on the training parameters and model architecture, please see the Methods section of the manuscript.

In the geneformer subfolder:

The pretrain_geneformer.py script is used to pre-train new geneformer models. Before pre-training a geneformer model, the test/train/val splits of the data must be tokenized using tokenize_data.py.

Evaluations

This folder contains two subfolder for evaluation LDVAE and Geneformer models.

The ldvae subfolder contains two scripts:

LDVAE_eval.py estimates reconstruction accuracies for all model/evaluation combinations.
LDVAE_eval_class.py defines a python class containing a method for estimating reconstruction accuracy. It also contains utilities to (1) create a sample input/reconstruction scatterplot, (2) obtain the latent representation of a dataset from a particular model, and (3) compute expression reconstruction residuals.

The geneformer subfolder contains one script:

zeroshot_eval_geneformer.py extracts embeddings and evaluates the models zero-shot performance. This script takes in ./Training/geneformer/adata_var.csv

Questions and Feedback

If you have any questions, or find any issues with the code, please open an issue in this repository. We also welcome any contributions to the code - be sure to checkout the Contributing section below.

If you have questions or concerns with this project and do not want to create an issue, please contact Ajay Nadig or Lorin Crawford. Any feedback on the software, manuscript, and tutorials is appreciated.

Relevant Citation (BibTeX)

@article {ID,
	author = {Nadig, Ajay and Thoutam, Akshaya and Hughes, Madeline and Gupta, Anay and Navia, Andrew W. and Fusi, Nicolo and Raghavan, Srivatsan and Winter, Peter S. and Amini, Ava P. and Crawford, Lorin},
	title = {Consequences of training data composition for deep generative models in single-cell biology},
	elocation-id = {2025.02.19.639127},
	year = {2025},
	doi = {10.1101/2025.02.19.639127},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/02/24/2025.02.19.639127},
	eprint = {https://www.biorxiv.org/content/early/2025/02/24/2025.02.19.639127.full.pdf},
	journal = {bioRxiv}
}

License

This project is available under the MIT License.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
Evaluations		Evaluations
Preprocess		Preprocess
Training		Training
.DS_Store		.DS_Store
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Fig1_crop_twitter.jpg		Fig1_crop_twitter.jpg
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
crop_fig1.jpg		crop_fig1.jpg
scvi_env_3.txt		scvi_env_3.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Consequences of training data composition for deep generative models in single-cell biology

Abstract

Dependencies

Reproducing Results

Datasets and Preprocessing

scTab

The Curated Cancer Cell Atlas (3CA)

K562 and Jurkat Perturb-seq

Transcription factor (TF) atlas

Human Brain Cell Atlas (HBCA)

Training

Evaluations

Questions and Feedback

Relevant Citation (BibTeX)

License

Contributing

Trademarks

About

Uh oh!

Releases

Uh oh!

Contributors 4

Uh oh!

Languages

License

microsoft/scFM-datamix

Folders and files

Latest commit

History

Repository files navigation

Consequences of training data composition for deep generative models in single-cell biology

Abstract

Dependencies

Reproducing Results

Datasets and Preprocessing

scTab

The Curated Cancer Cell Atlas (3CA)

K562 and Jurkat Perturb-seq

Transcription factor (TF) atlas

Human Brain Cell Atlas (HBCA)

Training

Evaluations

Questions and Feedback

Relevant Citation (BibTeX)

License

Contributing

Trademarks

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 4

Uh oh!

Languages