Abstract

Single-cell RNA sequencing (scRNA-seq) technologies have precipitated the development of bioinformatic tools to reconstruct cell lineage specification and differentiation processes with single-cell precision. However, current start-up costs and recommended data volumes for statistical analysis remain prohibitively expensive, preventing scRNA-seq technologies from becoming mainstream. Here, we introduce single-cell amalgamation by latent semantic analysis (SALSA), a versatile workflow that combines measurement reliability metrics with latent variable extraction to infer robust expression profiles from ultra-sparse sc-RNAseq data. SALSA uses a matrix focusing approach that starts by identifying facultative genes with expression levels greater than experimental measurement precision and ends with cell clustering based on a minimal set of Profiler genes, each one a putative biomarker of cluster-specific expression profiles. To benchmark how SALSA performs in experimental settings, we used the publicly available 10X Genomics PBMC 3K dataset, a pre-curated silver standard from human frozen peripheral blood comprising 2,700 single-cell barcodes, and identified 7 major cell groups matching transcriptional profiles of peripheral blood cell types and driven agnostically by < 500 Profiler genes. Finally, we demonstrate successful implementation of SALSA in a replicative scRNA-seq scenario by using previously published DropSeq data from a multi-batch mouse retina experimental design, thereby identifying 10 transcriptionally distinct cell types from > 64,000 single cells across 7 independent biological replicates based on < 630 Profiler genes. With these results, SALSA demonstrates that robust pattern detection from scRNA-seq expression matrices only requires a fraction of the accrued data, suggesting that single-cell sequencing technologies can become affordable and widespread if meant as hypothesis-generation tools to extract large-scale differential expression effects.

Highlights

  • Next-generation sequencing technologies are transforming how biologists characterize the molecular features of organogenesis and the composition of heterogeneous tissues; among them, RNA sequencing (RNA-seq) is one of the most widely adopted modalities (Mortazavi et al, 2008; Oshlack et al, 2010; Roy et al, 2011)

  • PBMC 3K Exhibits Near-Unary Architecture To evaluate single-cell amalgamation by latent semantic analysis (SALSA), we analyzed a publicly available “silver” standard dataset that is widely regarded for its single-cell coverage richness: the frozen Peripheral Blood Mononuclear Cells data set with 2,700 barcodes available through 10X Genomics

  • SALSA provides the means to refine the process of identifying candidate biomarkers from replicative assays even further: it can take independently sequenced scRNA-seq libraries, determine subsets of replicated genes ranking at different levels of prospective reproducibility for each—from facultative to profiler genes—and prioritize which commonly detected genes to include for an all-at-once scRNA-seq analyses

Read more

Summary

INTRODUCTION

Next-generation sequencing technologies are transforming how biologists characterize the molecular features of organogenesis and the composition of heterogeneous tissues; among them, RNA sequencing (RNA-seq) is one of the most widely adopted modalities (Mortazavi et al, 2008; Oshlack et al, 2010; Roy et al, 2011). We inferred, a mixture model of 2 or more extreme value distributions combined, each predominant in different scales of UMI tallies, could be used as an empirical parametric descriptor of total UMI counts per cell (or per gene) for the scRNA-seq dataset altogether With this in mind, we defined a general twocomponent mixture distribution, the PC-PD mixture model (Supplementary Figure S1A), that bridges two extreme scenarios to expect from different scRNA-seq techniques: (a) a finite number of barcodes is available, and all detected artifact and single-cell barcodes share a similar baseline level of UMI counts derived from nucleic acid debris throughout the biological specimen (“noise lifts barcodes,” akin to combinatorial based scRNA-seq techniques, Frechét distribution); and (b) there are substantially more artifact barcodes with low total UMI counts than single-cell barcodes with higher total UMI counts (“noise gets barcodes,” akin to droplet-based scRNA-seq techniques, Weibull distribution). We convey gene stratification results hereafter using a short-hand graphical aid, the “frosty” plot, that illustrates the transition in data retention across filters of rising statistical stringency (Figure 3B)

RESULTS
DISCUSSION
DATA AVAILABILITY STATEMENT
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.