A fast and efficient count-based matrix factorization method for detecting cell types from single-cell RNAseq data

Shiquan Sun,Yang Liu,Yabo Chen,Xuequn Shang

doi:10.1186/s12918-019-0699-6

Shiquan Sun, Yang Liu + Show 2 more

Open Access

PDF Available

https://doi.org/10.1186/s12918-019-0699-6

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundSingle-cell RNA sequencing (scRNAseq) data always involves various unwanted variables, which would be able to mask the true signal to identify cell-types. More efficient way of dealing with this issue is to extract low dimension information from high dimensional gene expression data to represent cell-type structure. In the past two years, several powerful matrix factorization tools were developed for scRNAseq data, such as NMF, ZIFA, pCMF and ZINB-WaVE. But the existing approaches either are unable to directly model the raw count of scRNAseq data or are really time-consuming when handling a large number of cells (e.g. n>500).ResultsIn this paper, we developed a fast and efficient count-based matrix factorization method (single-cell negative binomial matrix factorization, scNBMF) based on the TensorFlow framework to infer the low dimensional structure of cell types. To make our method scalable, we conducted a series of experiments on three public scRNAseq data sets, brain, embryonic stem, and pancreatic islet. The experimental results show that scNBMF is more powerful to detect cell types and 10 - 100 folds faster than the scRNAseq bespoke tools.ConclusionsIn this paper, we proposed a fast and efficient count-based matrix factorization method, scNBMF, which is more powerful for detecting cell type purposes. A series of experiments were performed on three public scRNAseq data sets. The results show that scNBMF is a more powerful tool in large-scale scRNAseq data analysis. scNBMF was implemented in R and Python, and the source code are freely available at https://github.com/sqsun.

Highlights

Single-cell RNA sequencing data always involves various unwanted variables, which would be able to mask the true signal to identify cell-types
Drop-seq [15], etc, do not have enough power to quantify the actual concentration of mRNAs [16]; the heavy amplifications may result into strong amplification bias [17]; cell cycle state, cell size or other unknown factors may contribute to cell-cell heterogeneity even within the same cell type [18]
Several dimensionality reduction techniques have been already applied to scRNAseq data analysis, such as principal component analysis (PCA) [24]; independent components analysis (ICA) [25], and diffusion map [26]; partial least squares (PLS) [27, 28]; nonnegative matrix factorization [29, 30], gene expression levels are inherently quantified by counts, i.e., count nature of scRNAseq data [31, 32]

Summary

Introduction

Single-cell RNA sequencing (scRNAseq) data always involves various unwanted variables, which would be able to mask the true signal to identify cell-types. The scRNAseq data set has their own characterizes, such as gene expression matrix is extremely sparse because of the quite small number of mRNAs represented in each cell [13]; current sequencing technologies, e.g. CEL-Seq2 [14] and. To account for the count nature of the RNA sequencing data, and the resulting mean-variance dependence, most statistical methods were developed using discrete distributions in differential expression analysis, i.e., PQLseq [20], edgeR/DESeq [21, 22], and MACAU [23]. A nature choice of analyzing scRNAseq data is to develop count-based dimensionality reduction methods. Several dimensionality reduction techniques have been already applied to scRNAseq data analysis, such as principal component analysis (PCA) [24]; independent components analysis (ICA) [25], and diffusion map [26]; partial least squares (PLS) [27, 28]; nonnegative matrix factorization (or factor analysis) [29, 30], gene expression levels are inherently quantified by counts, i.e., count nature of scRNAseq data [31, 32]

Methods

Results

Conclusion