Abstract

Clustering is an efficient way to analyze single-cell RNA sequencing data. It is commonly used to identify cell types, which can help in understanding cell differentiation processes. However, different clustering results can be obtained from different single-cell clustering methods, sometimes including conflicting conclusions, and biologists will often fail to get the right clustering results and interpret the biological significance. The cluster ensemble strategy can be an effective solution for the problem. As the graph partitioning-based clustering methods are good at clustering single-cell, we developed Sc-GPE, a novel cluster ensemble method combining five single-cell graph partitioning-based clustering methods. The five methods are SNN-cliq, PhenoGraph, SC3, SSNN-Louvain, and MPGS-Louvain. In Sc-GPE, a consensus matrix is constructed based on the five clustering solutions by calculating the probability that the cell pairs are divided into the same cluster. It solved the problem in the hypergraph-based ensemble approach, including the different cluster labels that were assigned in the individual clustering method, and it was difficult to find the corresponding cluster labels across all methods. Then, to distinguish the different importance of each method in a clustering ensemble, a weighted consensus matrix was constructed by designing an importance score strategy. Finally, hierarchical clustering was performed on the weighted consensus matrix to cluster cells. To evaluate the performance, we compared Sc-GPE with the individual clustering methods and the state-of-the-art SAME-clustering on 12 single-cell RNA-seq datasets. The results show that Sc-GPE obtained the best average performance, and achieved the highest NMI and ARI value in five datasets.

Highlights

  • Single-cell RNA sequencing data measures the gene expression level in individual cells instead of the average gene expression level in bulk RNA-seq cells (Stuart and Satija, 2019)

  • SAFE-clustering (Yang et al, 2019) implemented a hypergraph-based strategy to ensemble CIDR, Seurat, tSNE, and SC3 to construct a consensus matrix. k-means was used to cluster cells. They proposed the SAME-clustering (Huh et al, 2020) methods by using a consensus matrix-based strategy to ensemble the same four clustering methods and combining the Expectation-Maximization algorithm to cluster cells. We find that these cluster ensemble methods are based on hypergraph-based or voting-based integrated learning and do not consider the different importance of the individual clustering method

  • Sc-GPE has three following advantages: (1) it does not need to deal with the problem of different cluster labels from different cluster methods, so it is suitable for unsupervised clustering lacking the true cluster labels; (2) It is easy to implement since no special parameters need to be adjusted; (3) The weighted strategy is comprehensible and effective

Read more

Summary

Introduction

Single-cell RNA sequencing (scRNA-seq) data measures the gene expression level in individual cells instead of the average gene expression level in bulk RNA-seq cells (Stuart and Satija, 2019). It has advantages in accurately identifying the transcriptomic signatures for cell types (Grün et al, 2015). Along with the rapid development of scRNA-seq technologies, the cost of sequencing is reduced, and larger datasets are generated, carrying a higher error rate (Vitak et al, 2017). The drop-out rate from reverse transcription failure and sequencing depth would reach 80% (Soneson and Robinson, 2018; Andrews and Hemberg, 2019); (2) high dimension.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call