Abstract

Single-cell ATAC-seq (scATAC-seq) profiles the chromatin accessibility landscape at single cell level, thus revealing cell-to-cell variability in gene regulation. However, the high dimensionality and sparsity of scATAC-seq data often complicate the analysis. Here, we introduce a method for analyzing scATAC-seq data, called Single-Cell ATAC-seq analysis via Latent feature Extraction (SCALE). SCALE combines a deep generative framework and a probabilistic Gaussian Mixture Model to learn latent features that accurately characterize scATAC-seq data. We validate SCALE on datasets generated on different platforms with different protocols, and having different overall data qualities. SCALE substantially outperforms the other tools in all aspects of scATAC-seq data analysis, including visualization, clustering, and denoising and imputation. Importantly, SCALE also generates interpretable features that directly link to cell populations, and can potentially reveal batch effects in scATAC-seq experiments.

Highlights

  • Single-cell ATAC-seq profiles the chromatin accessibility landscape at single cell level, revealing cell-to-cell variability in gene regulation

  • SCALE models the input scATAC-seq data x as a joint distribution pðx; z; cÞ where c is one of predefined K clusters corresponding to a component of Gaussian Mixture Model (GMM), z is the latent variable obtained by z 1⁄4 μz þ σ z ε, where μz and σz are learned by the encoder network from x, and ε is sampled from Nð0; IÞ16

  • K predefined clusters, p(z|c) follows a mixture of Gaussians distribution with a mean μc and a variance σc for each component corresponding to a cluster c, and p(x|z) is a multivariable Bernoulli distribution modeled by the decoder network (Fig. 1)

Read more

Summary

Results

The imputation of SCALE could strengthen the distinct patterns of cluster-specific peaks by filling missing values and removing potential noise (Supplementary Fig. 10), which improves downstream analysis, for example the identification of cell-type-specific motifs and transcription factors by chromVAR We demonstrated this feature with the Forebrain dataset. We constructed the dataset by first generating reference scATAC-seq data consisting of three clusters, each containing 100 peaks with no missing values, randomly dropping out peaks and introducing noise (Methods, Supplementary Fig. 14a). In the embedding and clustering results based on the SCALE-extracted features, the cells of each replicate were distributed evenly in the low-dimensional space (Supplementary Fig. 17c) We confirmed this result by checking the top specific peaks for each replicate based on raw data and found no significantly different pattern across replicates We could improve the model to explicitly incorporate variables that are designated for the discovery and removal of batch effects and other possible technical variations

Methods
Code availability

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.