Abstract

BackgroundSingle-cell RNA sequencing (scRNA-seq) technology provides an effective way to study cell heterogeneity. However, due to the low capture efficiency and stochastic gene expression, scRNA-seq data often contains a high percentage of missing values. It has been showed that the missing rate can reach approximately 30% even after noise reduction. To accurately recover missing values in scRNA-seq data, we need to know where the missing data is; how much data is missing; and what are the values of these data.MethodsTo solve these three problems, we propose a novel model with a hybrid machine learning method, namely, missing imputation for single-cell RNA-seq (MISC). To solve the first problem, we transformed it to a binary classification problem on the RNA-seq expression matrix. Then, for the second problem, we searched for the intersection of the classification results, zero-inflated model and false negative model results. Finally, we used the regression model to recover the data in the missing elements.ResultsWe compared the raw data without imputation, the mean-smooth neighbor cell trajectory, MISC on chronic myeloid leukemia data (CML), the primary somatosensory cortex and the hippocampal CA1 region of mouse brain cells. On the CML data, MISC discovered a trajectory branch from the CP-CML to the BC-CML, which provides direct evidence of evolution from CP to BC stem cells. On the mouse brain data, MISC clearly divides the pyramidal CA1 into different branches, and it is direct evidence of pyramidal CA1 in the subpopulations. In the meantime, with MISC, the oligodendrocyte cells became an independent group with an apparent boundary.ConclusionsOur results showed that the MISC model improved the cell type classification and could be instrumental to study cellular heterogeneity. Overall, MISC is a robust missing data imputation model for single-cell RNA-seq data.

Highlights

  • Single-cell RNA sequencing technology provides an effective way to study cell heterogeneity

  • Single-cell RNA-seq expression profiling offers a static snapshot of the gene expression, provides estimates of cell heterogeneity and rare cell type detection

  • T-SNE on missing imputation for single-cell RNA-seq (MISC) imputed data proves the evolution from CP to blast crisis (BC) stem cells as our trajectory analysis and presents more compact clusters

Read more

Summary

Introduction

Single-cell RNA sequencing (scRNA-seq) technology provides an effective way to study cell heterogeneity. Due to the low capture efficiency and stochastic gene expression, scRNA-seq data often contains a high percentage of missing values. Advances in single cell genomics research have provided unprecedented opportunities in biomedical research where it is important to identify different cell types pertinent to aging and cellular malignancy. Single-cell RNA sequencing (scRNA-seq) data analysis provides us an opportunity to study the heterogeneity of cells and the genes that are differentially expressed across biological conditions, it is a challenging process to perform the analysis. With the fast-increase in scRNA-seq data, computational methods need to overcome challenges ranging from handling technical noise to constructing and characterizing cell identities, and to cell lineage analysis through computing high-dimensional sparse matrixes. Innovative, efficient, robust, and scalable computational analysis methods are essential to this new frontier

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call