Single-cell RNA sequencing is a transformative technology that enables us to study the heterogeneity of the tissue at the cellular level. Clustering is used as the key computational approach to group cells under the transcriptome profiles from single-cell RNA-seq data. However, accurate identification of distinct cell types is facing the challenge of high dimensionality, and it could cause uninformative clusters when clustering is directly applied on the original transcriptome. To address such challenge, an evolutionary multiobjective deep clustering (EMDC) algorithm is proposed to identify single-cell RNA-seq data in this study. First, EMDC removes redundant and irrelevant genes by applying the differential gene expression analysis to identify differentially expressed genes across biological conditions. After that, a deep autoencoder is proposed to project the high-dimensional data into different low-dimensional nonlinear embedding subspaces under different bottleneck layers. Then, the basic clustering algorithm is applied in those nonlinear embedding subspaces to generate some basic clustering results to produce the cluster ensemble. To lessen the unnecessary cost produced by those clusterings in the ensemble, the multiobjective evolutionary optimization is designed to prune the basic clustering results in the ensemble, unleashing its cell type discovery performance under three objective functions. Multiple experiments have been conducted on 30 synthetic single-cell RNA-seq datasets and six real single-cell RNA-seq datasets, which reveal that EMDC outperforms eight other clustering methods and three multiobjective optimization algorithms in cell type identification. In addition, we have conducted extensive comparisons to effectively demonstrate the impact of each component in our proposed EMDC.
Read full abstract