Abstract
BackgroundSingle-cell RNA sequencing (scRNA-seq) has emerged has a main strategy to study transcriptional activity at the cellular level. Clustering analysis is routinely performed on scRNA-seq data to explore, recognize or discover underlying cell identities. The high dimensionality of scRNA-seq data and its significant sparsity accentuated by frequent dropout events, introducing false zero count observations, make the clustering analysis computationally challenging. Even though multiple scRNA-seq clustering techniques have been proposed, there is no consensus on the best performing approach. On a parallel research track, self-supervised contrastive learning recently achieved state-of-the-art results on images clustering and, subsequently, image classification.ResultsWe propose contrastive-sc, a new unsupervised learning method for scRNA-seq data that perform cell clustering. The method consists of two consecutive phases: first, an artificial neural network learns an embedding for each cell through a representation training phase. The embedding is then clustered in the second phase with a general clustering algorithm (i.e. KMeans or Leiden community detection). The proposed representation training phase is a new adaptation of the self-supervised contrastive learning framework, initially proposed for image processing, to scRNA-seq data. contrastive-sc has been compared with ten state-of-the-art techniques. A broad experimental study has been conducted on both simulated and real-world datasets, assessing multiple external and internal clustering performance metrics (i.e. ARI, NMI, Silhouette, Calinski scores). Our experimental analysis shows that constastive-sc compares favorably with state-of-the-art methods on both simulated and real-world datasets.ConclusionOn average, our method identifies well-defined clusters in close agreement with ground truth annotations. Our method is computationally efficient, being fast to train and having a limited memory footprint. contrastive-sc maintains good performance when only a fraction of input cells is provided and is robust to changes in hyperparameters or network architecture. The decoupling between the creation of the embedding and the clustering phase allows the flexibility to choose a suitable clustering algorithm (i.e. KMeans when the number of expected clusters is known, Leiden otherwise) or to integrate the embedding with other existing techniques.
Highlights
Single-cell RNA sequencing has emerged has a main strategy to study transcriptional activity at the cellular level
An artificial neural network is trained to produce representations for each cell which is clustered in a second phase with a general clustering algorithm
This distinction has been made because some of the existing libraries require to input the number of clusters to be identified while others can dynamically infer it from various data density or connectivity criteria
Summary
Single-cell RNA sequencing (scRNA-seq) has emerged has a main strategy to study transcriptional activity at the cellular level. The high dimensionality of scRNA-seq data and its significant sparsity accentuated by frequent dropout events, introducing false zero count observations, make the clustering analysis computationally challenging. In the absence of cell type annotations, unsupervised clustering models are typically employed to identify or discover cellular subtypes in scRNA-seq data. Despite the extensive study of clustering models in machine learning [2, 3], single-cell transcriptomic clustering remains challenging due to the high dimensionality of data (the number of transcripts is usually greater than 20,000, leading to “the curse of dimensionality”), the high sparsity due to low mRNA expression level and dropout events. Numerous clustering methods emerged to propose diverse solutions to the technical challenges raised by scRNA-seq data analysis, as shown in review papers [4,5,6,7]. Several scRNA-seq analysis methods, including Seurat, have been made available in the python package scanpy [14]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.