Contrastive self-supervised clustering of scRNA-seq data

Madalina Ciortan,Matthieu Defrance

doi:10.1186/s12859-021-04210-8

Abstract

BackgroundSingle-cell RNA sequencing (scRNA-seq) has emerged has a main strategy to study transcriptional activity at the cellular level. Clustering analysis is routinely performed on scRNA-seq data to explore, recognize or discover underlying cell identities. The high dimensionality of scRNA-seq data and its significant sparsity accentuated by frequent dropout events, introducing false zero count observations, make the clustering analysis computationally challenging. Even though multiple scRNA-seq clustering techniques have been proposed, there is no consensus on the best performing approach. On a parallel research track, self-supervised contrastive learning recently achieved state-of-the-art results on images clustering and, subsequently, image classification.ResultsWe propose contrastive-sc, a new unsupervised learning method for scRNA-seq data that perform cell clustering. The method consists of two consecutive phases: first, an artificial neural network learns an embedding for each cell through a representation training phase. The embedding is then clustered in the second phase with a general clustering algorithm (i.e. KMeans or Leiden community detection). The proposed representation training phase is a new adaptation of the self-supervised contrastive learning framework, initially proposed for image processing, to scRNA-seq data. contrastive-sc has been compared with ten state-of-the-art techniques. A broad experimental study has been conducted on both simulated and real-world datasets, assessing multiple external and internal clustering performance metrics (i.e. ARI, NMI, Silhouette, Calinski scores). Our experimental analysis shows that constastive-sc compares favorably with state-of-the-art methods on both simulated and real-world datasets.ConclusionOn average, our method identifies well-defined clusters in close agreement with ground truth annotations. Our method is computationally efficient, being fast to train and having a limited memory footprint. contrastive-sc maintains good performance when only a fraction of input cells is provided and is robust to changes in hyperparameters or network architecture. The decoupling between the creation of the embedding and the clustering phase allows the flexibility to choose a suitable clustering algorithm (i.e. KMeans when the number of expected clusters is known, Leiden otherwise) or to integrate the embedding with other existing techniques.

Highlights

Single-cell RNA sequencing has emerged has a main strategy to study transcriptional activity at the cellular level
An artificial neural network is trained to produce representations for each cell which is clustered in a second phase with a general clustering algorithm
This distinction has been made because some of the existing libraries require to input the number of clusters to be identified while others can dynamically infer it from various data density or connectivity criteria

Summary

Introduction

Single-cell RNA sequencing (scRNA-seq) has emerged has a main strategy to study transcriptional activity at the cellular level. The high dimensionality of scRNA-seq data and its significant sparsity accentuated by frequent dropout events, introducing false zero count observations, make the clustering analysis computationally challenging. In the absence of cell type annotations, unsupervised clustering models are typically employed to identify or discover cellular subtypes in scRNA-seq data. Despite the extensive study of clustering models in machine learning [2, 3], single-cell transcriptomic clustering remains challenging due to the high dimensionality of data (the number of transcripts is usually greater than 20,000, leading to “the curse of dimensionality”), the high sparsity due to low mRNA expression level and dropout events. Numerous clustering methods emerged to propose diverse solutions to the technical challenges raised by scRNA-seq data analysis, as shown in review papers [4,5,6,7]. Several scRNA-seq analysis methods, including Seurat, have been made available in the python package scanpy [14]

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: May 27, 2021
Citations: 42	License type: open-access

R Discovery Prime

R Discovery Prime

Contrastive self-supervised clustering of scRNA-seq data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Deep Denoising Sparse Coding
Yijie Wang ... Bo Yang
-
Yijie Wang, et. al.Yijie Wang ... Bo Yang
01 Nov 2020
01 Nov 2020

CaSS: A Channel-Aware Self-supervised Representation Learning Framework for Multivariate Time Series Classification
Yijiang Chen ... Zhen Xing
-
Yijiang Chen, et. al.Yijiang Chen ... Zhen Xing
01 Jan 2021
01 Jan 2021

Clustering single-cell RNA-seq data with a model-based deep learning approach
Tian Tian ... Qi Song
Nature Machine Intelligence | VOL. 1
Tian Tian, et. al.Tian Tian ... Qi Song
01 Apr 2019
Nature Machine Intelligence | VOL. 1

ScDSSC: Deep Sparse Subspace Clustering for scRNA-seq Data.
Haiyun Wang ... Jianping Zhao
PLOS Computational Biology | VOL. 18
Haiyun Wang, et. al.Haiyun Wang ... Jianping Zhao
19 Dec 2022
PLOS Computational Biology | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Contrastive self-supervised clustering of scRNA-seq data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics