Abstract

Clustering is a prevalent analytical means to analyze single cell RNA sequencing (scRNA-seq) data but the rapidly expanding data volume can make this process computationally challenging. New methods for both accurate and efficient clustering are of pressing need. Here we proposed Spearman subsampling-clustering-classification (SSCC), a new clustering framework based on random projection and feature construction, for large-scale scRNA-seq data. SSCC greatly improves clustering accuracy, robustness, and computational efficacy for various state-of-the-art algorithms benchmarked on multiple real datasets. On a dataset with 68,578 human blood cells, SSCC achieved 20% improvement for clustering accuracy and 50-fold acceleration, but only consumed 66% memory usage, compared to the widelyused software package SC3. Compared to k-means, the accuracy improvement of SSCC can reach 3-fold. An R implementation of SSCC is available at https://github.com/Japrin/sscClust.

Highlights

  • Single cell RNA sequencing has revolutionized transcriptomic studies by revealing the heterogeneity of individual cells with high resolution [1,2,3,4,5,6]

  • By evaluating five clustering algorithms including k-means, k-medoids, affinity propagation, SC3 and SIMLR, we observed that SSCC can significantly improve the clustering accuracy for all the five clustering algorithms on all the on the Zheng dataset, which depended on the specific selection of algorithms and aCC-BY-NC-ND 4.0 International license

  • Pearson correlations of silhouette values between the two clustering schemes were increased from 0.47 to 0.58 when switching from SCC to SSCC (Figure 6b). All these metrics suggest that SSCC can greatly improve the clustering efficiency and accuracy for large-scale scRNA-seq datasets, and can greatly improve the consistency

Read more

Summary

Introduction

Single cell RNA sequencing (scRNA-seq) has revolutionized transcriptomic studies by revealing the heterogeneity of individual cells with high resolution [1,2,3,4,5,6]. 1, BIOPIC, Beijing Advanced Innovation Center for Genomics, and School of Life. Multiple clustering algorithms have been developed, including Seurat [11], SC3 [12], SIMLR [13], ZIFA [14], CIDR [15], SNN-Cliq[16]. The copyright holder for this preprint It is made available under greatly but often have high computational complexity, impeding the extension of these

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call