Abstract

BackgroundA key task in single-cell RNA-seq (scRNA-seq) data analysis is to accurately detect the number of cell types in the sample, which can be critical for downstream analyses such as cell type identification. Various scRNA-seq data clustering algorithms have been specifically designed to automatically estimate the number of cell types through optimising the number of clusters in a dataset. The lack of benchmark studies, however, complicates the choice of the methods.ResultsWe systematically benchmark a range of popular clustering algorithms on estimating the number of cell types in a variety of settings by sampling from the Tabula Muris data to create scRNA-seq datasets with a varying number of cell types, varying number of cells in each cell type, and different cell type proportions. The large number of datasets enables us to assess the performance of the algorithms, covering four broad categories of approaches, from various aspects using a panel of criteria. We further cross-compared the performance on datasets with high cell numbers using Tabula Muris and Tabula Sapiens data.ConclusionsWe identify the strengths and weaknesses of each method on multiple criteria including the deviation of estimation from the true number of cell types, variability of estimation, clustering concordance of cells to their predefined cell types, and running time and peak memory usage. We then summarise these results into a multi-aspect recommendation to the users. The proposed stability-based approach for estimating the number of cell types is implemented in an R package and is freely available from (https://github.com/PYangLab/scCCESS).

Highlights

  • A key task in single-cell RNA-seq data analysis is to accurately detect the number of cell types in the sample, which can be critical for downstream analyses such as cell type identification

  • Current clustering methods that estimate the number of cell types can be loosely classified into the following categories: (i) intra- and inter-cluster similarity, (ii) modularity in community detection, (iii) eigenvector-based metrics, and (iv) stability metrics

  • The number of cells is kept the same among all cell types in setting 1 and 2, whereas in setting 3, the number of cells is different between major and minor cell types. We subsampled from both the Tabula Muris and the Tabula Sapiens [35] datasets to create a fourth setting in which datasets are with a large number of cells (2500 to 10,000)

Read more

Summary

Introduction

A key task in single-cell RNA-seq (scRNA-seq) data analysis is to accurately detect the number of cell types in the sample, which can be critical for downstream analyses such as cell type identification. While much attention has been given to clustering cells into cell type groups, estimating the number of cell types in a given scRNA-seq dataset has received less attention. Estimating the number of cell types can be considered as finding the optimal number of clusters for a given scRNA-seq data with the assumption that each cluster corresponds to a unique cell type in the dataset [6]. Under this assumption, current clustering methods that estimate the number of cell types can be loosely classified into the following categories: (i) intra- and inter-cluster similarity, (ii) modularity in community detection, (iii) eigenvector-based metrics, and (iv) stability metrics. Given the lack of systematic evaluation of clustering algorithms on their performance on estimating the number of cell types, in this study, we set out to systematically assess the estimation of the number of cell types for a collection of clustering algorithms from each category summarised below

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call