Abstract

Deep generative models, such as variational autoencoders (VAEs) or deep Boltzmann machines (DBMs), can generate an arbitrary number of synthetic observations after being trained on an initial set of samples. This has mainly been investigated for imaging data but could also be useful for single-cell transcriptomics (scRNA-seq). A small pilot study could be used for planning a full-scale experiment by investigating planned analysis strategies on synthetic data with different sample sizes. It is unclear whether synthetic observations generated based on a small scRNA-seq dataset reflect the properties relevant for subsequent data analysis steps. We specifically investigated two deep generative modeling approaches, VAEs and DBMs. First, we considered single-cell variational inference (scVI) in two variants, generating samples from the posterior distribution, the standard approach, or the prior distribution. Second, we propose single-cell deep Boltzmann machines (scDBMs). When considering the similarity of clustering results on synthetic data to ground-truth clustering, we find that the scVI_{posterior} variant resulted in high variability, most likely due to amplifying artifacts of small datasets. All approaches showed mixed results for cell types with different abundance by overrepresenting highly abundant cell types and missing less abundant cell types. With increasing pilot dataset sizes, the proportions of the cells in each cluster became more similar to that of ground-truth data. We also showed that all approaches learn the univariate distribution of most genes, but problems occurred with bimodality. Across all analyses, in comparing 10times Genomics and Smart-seq2 technologies, we could show that for 10times datasets, which have higher sparsity, it is more challenging to make inference from small to larger datasets. Overall, the results show that generative deep learning approaches might be valuable for supporting the design of scRNA-seq experiments.

Highlights

  • Deep generative models, such as variational autoencoders (VAEs)[1,2] or deep Boltzmann machines (DBMs)[3], can learn the joint distribution of various types of data, and impressive results have been obtained, e.g., for generating super-resolution images in ­microscopy[4] and more generally for ­imputation[5,6]

  • To examine the quality of the data generated by single-cell variational inference (scVI) and single-cell DBMs, we used the example of designing a scRNA-seq experiment

  • By mimicking a situation where we want to plan an experiment using a pilot study with a small number of cells, we investigated the impact of varying amounts of cells and generative approaches on the clustering performance, measured by the Davies–Bouldin index (DBI)

Read more

Summary

Introduction

Deep generative models, such as variational autoencoders (VAEs)[1,2] or deep Boltzmann machines (DBMs)[3], can learn the joint distribution of various types of data, and impressive results have been obtained, e.g., for generating super-resolution images in ­microscopy[4] and more generally for ­imputation[5,6] This raises the question of whether such techniques could be trained on data with a rather small number of samples, e.g., obtained from pilot experiments, for subsequently generating larger synthetic datasets. Such synthetic observations could inform the design of single-cell RNA sequencing (scRNA-seq) experiments by exploring planned subsequent analysis steps, such as clustering, on synthetic datasets of different sizes. Researchers would specify different numbers of cells to be simulated, apply downstream analyses to the simulated data, after which they evaluate the number of cells needed for detecting patterns of interest, such as clusters comprising rare cell types

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call