Abstract
The recently developed droplet-based single-cell transcriptome sequencing (scRNA-seq) technology makes it feasible to perform a population-scale scRNA-seq study, in which the transcriptome is measured for tens of thousands of single cells from multiple individuals. Despite the advances of many clustering methods, there are few tailored methods for population-scale scRNA-seq studies. Here, we develop a Bayesian mixture model for single-cell sequencing (BAMM-SC) method to cluster scRNA-seq data from multiple individuals simultaneously. BAMM-SC takes raw count data as input and accounts for data heterogeneity and batch effect among multiple individuals in a unified Bayesian hierarchical model framework. Results from extensive simulation studies and applications of BAMM-SC to in-house experimental scRNA-seq datasets using blood, lung and skin cells from humans or mice demonstrate that BAMM-SC outperformed existing clustering methods with considerable improved clustering accuracy, particularly in the presence of heterogeneity among individuals.
Highlights
The recently developed droplet-based single-cell transcriptome sequencing technology makes it feasible to perform a population-scale scRNA-seq study, in which the transcriptome is measured for tens of thousands of single cells from multiple individuals
Compared to other clustering methods which ignore individual level variability, Bayesian mixture model for singlecell sequencing (BAMM-SC) has the following four key advantages: (1) BAMM-SC accounts for data heterogeneity among multiple individuals, such as unbalanced sequencing depths and technical biases in library preparation, and reduces the false positives of detecting individual-specific cell types
(3) BAMM-SC performs one-step clustering on raw unique molecular identifier (UMI) count matrix without any prior batch-correction step, which is required for most clustering methods in the presence of batch effect
Summary
The recently developed droplet-based single-cell transcriptome sequencing (scRNA-seq) technology makes it feasible to perform a population-scale scRNA-seq study, in which the transcriptome is measured for tens of thousands of single cells from multiple individuals. ScRNA-seq tailored unsupervised methods, such as SIMLR9, CellTree[10], SC311, TSCAN12, and DIMM-SC13, have been designed and proposed for clustering scRNA-seq data Supervised methods, such as MetaNeighbor, have been proposed to assess how well cell-type-specific transcriptional profiles replicate across different datasets[14]. Two new methods: mutual nearest neighbors[16] (MNN) (implemented in scran) and canonical correlation analysis (CCA)[17] (implemented in Seurat) were published for batch correction of scRNA-seq data All these methods require the raw counts to be transformed to continuous values under different assumptions, which may alter the data structure in some cell types and lead to difficulty of biological interpretation. We first conducted an exploratory data analysis to demonstrate the existence of batch effect in multiple individuals using both publicly available and three in-house synthetic droplet-based scRNA-seq datasets, including human peripheral blood mononuclear cells (PBMC), mouse lung and human skin tissues. We produced a t-SNE plot based on the first 50 principal components (Supplementary Fig. 1) of all cells from
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have