Network-Based Single-Cell RNA-Seq Data Imputation Enhances Cell Type Identification.

Maryam Zand,Jianhua Ruan

doi:10.3390/genes11040377

Abstract

Single-cell RNA sequencing is a powerful technology for obtaining transcriptomes at single-cell resolutions. However, it suffers from dropout events (i.e., excess zero counts) since only a small fraction of transcripts get sequenced in each cell during the sequencing process. This inherent sparsity of expression profiles hinders further characterizations at cell/gene-level such as cell type identification and downstream analysis. To alleviate this dropout issue we introduce a network-based method, netImpute, by leveraging the hidden information in gene co-expression networks to recover real signals. netImpute employs Random Walk with Restart (RWR) to adjust the gene expression level in a given cell by borrowing information from its neighbors in a gene co-expression network. Performance evaluation and comparison with existing tools on simulated data and seven real datasets show that netImpute substantially enhances clustering accuracy and data visualization clarity, thanks to its effective treatment of dropouts. While the idea of netImpute is general and can be applied with other types of networks such as cell co-expression network or protein–protein interaction (PPI) network, evaluation results show that gene co-expression network is consistently more beneficial, presumably because PPI network usually lacks cell type context, while cell co-expression network can cause information loss for rare cell types. Evaluation results on several biological datasets show that netImpute can more effectively recover missing transcripts in scRNA-seq data and enhance the identification and visualization of heterogeneous cell types than existing methods.

Highlights

In the past decade, advancements in next-generation sequencing technologies have revealed unprecedented insights into complex biological systems
As the first step to assess the performance of our method, we design a simulation study in which we generate gene expression data for 150 cells divided into three cell types, with a total of
We systematically increase the proportion of zeros from 72% to 87% to mimic different levels of complexity arising from excess zeros

Summary

Introduction

Advancements in next-generation sequencing technologies have revealed unprecedented insights into complex biological systems. These types of technologies have been increasingly developed to scrutinize a diverse range of phenomena at single-cell resolutions. The recently emerging single-cell transcriptomic technologies each have their own unique competencies accompanied by weaknesses and restrictions regarding their accuracy, sensitivity, throughput, and precision. They suffer from technical noises mainly rooted in the low amount of starting mRNA in each cell. This causes as small as 5–15% of transcriptomes to be captured throughout the amplification process, leading to a partially observed version of the actual expression profile [7,8,9]

Methods

Results

Conclusion