A copula based topology preserving graph convolution network for clustering of single-cell RNA-seq data.

Snehalika Lall,Sanghamitra Bandyopadhyay,Sumanta Ray,Quan Zou

doi:10.1371/journal.pcbi.1009600

Snehalika Lall, Sanghamitra Bandyopadhyay + Show 2 more

Open Access

https://doi.org/10.1371/journal.pcbi.1009600

Copy DOI

Abstract

Annotation of cells in single-cell clustering requires a homogeneous grouping of cell populations. There are various issues in single cell sequencing that effect homogeneous grouping (clustering) of cells, such as small amount of starting RNA, limited per-cell sequenced reads, cell-to-cell variability due to cell-cycle, cellular morphology, and variable reagent concentrations. Moreover, single cell data is susceptible to technical noise, which affects the quality of genes (or features) selected/extracted prior to clustering. Here we introduce sc-CGconv (copula based graph convolution network for single clustering), a stepwise robust unsupervised feature extraction and clustering approach that formulates and aggregates cell-cell relationships using copula correlation (Ccor), followed by a graph convolution network based clustering approach. sc-CGconv formulates a cell-cell graph using Ccor that is learned by a graph-based artificial intelligence model, graph convolution network. The learned representation (low dimensional embedding) is utilized for cell clustering. sc-CGconv features the following advantages. a. sc-CGconv works with substantially smaller sample sizes to identify homogeneous clusters. b. sc-CGconv can model the expression co-variability of a large number of genes, thereby outperforming state-of-the-art gene selection/extraction methods for clustering. c. sc-CGconv preserves the cell-to-cell variability within the selected gene set by constructing a cell-cell graph through copula correlation measure. d. sc-CGconv provides a topology-preserving embedding of cells in low dimensional space.

Highlights

IntroductionRecent developments of single cell RNA-seq (scRNA-seq) technology made it possible to generate a huge volume of data allowing the researcher to measure and quantify RNA levels on large scales [1]
Clustering framework, which leverage landmark advantage of copula and graph convolution network in single-cell analysis domain. sc-CGconv outperforms the state-of-the-art feature selection/extraction methods in the preprocessing steps, performs well with small sample size data, can preserve the cell-to-cell variability within the extracted features, provides a topology-preserving embedding of cells in low dimensional space. sc-CGconv successfully addresses the above-mentioned key challenges
We demonstrate in experiments that (i) sc-CGconv leads to a pure clustering of cells in scRNA-seq data, (ii) the annotation of cells is accurate for unknown test samples (iii) the marker genes which are identified in the annotation step have a clear capability to segregate the cell types in the scRNA-seq data, and (iv) sc-CGconv can handle substantially large data with utmost accuracy

Summary

Introduction

Recent developments of single cell RNA-seq (scRNA-seq) technology made it possible to generate a huge volume of data allowing the researcher to measure and quantify RNA levels on large scales [1]. This has led to a greater understanding of the heterogeneity of cell population, disease states, cell types, developmental lineages, and many more. The standard pipeline of downstream analysis of scRNA-seq data starts from the processing of the raw count matrix, and goes through the following steps [8, 9]: i) normalization (and quality control) of the raw count matrix ii) gene selection, and cell filtering iii) dimensionality reduction, iv) unsupervised clustering of cells into groups (or clusters) and v) annotation of cells by assigning labels to each cluster. A good clustering (or classifying cell samples) can be ensured by the following characteristics of features obtained from the step-(iii): the features should contain information about the biology of the system, should not have features containing random noise, and should preserve the structure of data while reducing the size as much as possible

Methods

Results

Conclusion