Abstract

MotivationSingle-cell RNA-seq makes possible the investigation of variability in gene expression among cells, and dependence of variation on cell type. Statistical inference methods for such analyses must be scalable, and ideally interpretable.ResultsWe present an approach based on a modification of a recently published highly scalable variational autoencoder framework that provides interpretability without sacrificing much accuracy. We demonstrate that our approach enables identification of gene programs in massive datasets. Our strategy, namely the learning of factor models with the auto-encoding variational Bayes framework, is not domain specific and may be useful for other applications.Availability and implementationThe factor model is available in the scVI package hosted at https://github.com/YosefLab/scVI/.Contact v@nxn.se Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • The study of the regulatory architecture of cells has revealed numerous examples of co-regulation of transcription of large numbers of genes (Jang et al, 2017; Kondo et al, 2018), and this has been used to link the organization of cells to their distinct functions in response to developmental or external stimuli (Romero et al, 2012)

  • We show that using a flexible non-linear inference model along with a linear reconstruction function makes it possible to benefit from the efficiency of variational autoencoder (VAE), while retaining the interpretability provided by factor models

  • A comparison of the VAE with the linearly decoded variational autoencoder (LDVAE) methods showed that VAE has a smaller reconstruction error than the LDVAE methods (Fig. 1b)

Read more

Summary

Introduction

The study of the regulatory architecture of cells has revealed numerous examples of co-regulation of transcription of large numbers of genes (Jang et al, 2017; Kondo et al, 2018), and this has been used to link the organization of cells to their distinct functions in response to developmental or external stimuli (Romero et al, 2012). PCA models data as arising from a continuous multivariate Gaussian distribution, and optimizes a Gaussian likelihood (Pearson, 1901; Tipping and Bishop, 1999) This model assumption is at odds with the count data measured in single-cell RNA-seq (Svensson, 2020; William Townes et al, 2019), and leads to interpretation problems (Hicks et al, 2018). Inference using VAEs scales to arbitrarily large data since minibatches of data can be used to train the parameters for both the inference model and the decoder function (Kingma and Welling, 2013) Despite these efficiency advantages, the representations inferred with VAEs are not directly interpretable. By adapting the method of scVI (Lopez et al, 2018), we demonstrate a scalable approach to learning a latent representation of single-cell RNA-seq data, that identifies the relationship between cell representation coordinates and gene weights via a factor model. By virtue of being linear, our reconstruction function provides an interpretable link between gene programs and cellular molecular phenotypes (Fig. 1a)

Materials and methods
Results
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call