Abstract

Clustering of joint single-cell RNA-Seq (scRNA-Seq) data is often challenged by confounding factors, such as batch effects and biologically relevant variability. Existing batch effect removal methods typically require strong assumptions on the composition of cell populations being near identical across samples. Here, we present CIDER, a meta-clustering workflow based on inter-group similarity measures. We demonstrate that CIDER outperforms other scRNA-Seq clustering methods and integration approaches in both simulated and real datasets. Moreover, we show that CIDER can be used to assess the biological correctness of integration in real datasets, while it does not require the existence of prior cellular annotations.

Highlights

  • The widespread adoption of single-cell RNA sequencing as a modality for the investigation of functional cellular heterogeneity means it is routine for multiple datasets to be generated from the same type of tissues and organs across a number of individuals

  • Design of Clustering by IDER (CIDER) and proof-of-concept experiment The core of CIDER is the Inter-group Differential ExpRession (IDER) metric, which can be used to compute the similarity between two groups of cells across datasets (Fig. 1A)

  • Differential expression in IDER is computed using the same principle as limma-trend [17], which was chosen from a collection of approaches for differential expression analysis based on a number of performance criteria (Additional file 1: Fig. S1A, B) [18]

Read more

Summary

Introduction

The widespread adoption of single-cell RNA sequencing (scRNA-Seq) as a modality for the investigation of functional cellular heterogeneity means it is routine for multiple datasets to be generated from the same type of tissues and organs across a number of individuals. Integration of multiple scRNA-Seq datasets can provide more comprehensive interpretations by borrowing information across experiments and even species [1]. The data from multiple experiments are often confounded by inter-batch or inter-donor variability. Existing clustering workflows can effectively identify cell populations in batch-effectfree datasets [2], by partitioning cells based on the inter-cell distance matrix computed from the expression data of high variance genes (HVGs) or the derived principal components. SC3 constructs the distance matrix by applying Euclidean, Pearson, and Spearman metrics on the expression data of HVGs and transfers this distance matrix by principal component analysis (PCA) or graph Laplacian transformation, before consensus clustering [3]. RaceID computes the distance matrix in the same way as SC3 but provides more options of distance measures, including Kendall and proportionality [4].

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.