Abstract

In recent years, the application of single cell RNA-seq (scRNA-seq) has become more and more popular in fields such as biology and medical research. Analyzing scRNA-seq data can discover complex cell populations and infer single-cell trajectories in cell development. Clustering is one of the most important methods to analyze scRNA-seq data. In this paper, we focus on improving scRNA-seq clustering through gene selection, which also reduces the dimensionality of scRNA-seq data. Studies have shown that gene selection for scRNA-seq data can improve clustering accuracy. Therefore, it is important to select genes with cell type specificity. Gene selection not only helps to reduce the dimensionality of scRNA-seq data, but also can improve cell type identification in combination with clustering methods. Here, we proposed RFCell, a supervised gene selection method, which is based on permutation and random forest classification. We first use RFCell and three existing gene selection methods to select gene sets on 10 scRNA-seq data sets. Then, three classical clustering algorithms are used to cluster the cells obtained by these gene selection methods. We found that the gene selection performance of RFCell was better than other gene selection methods.

Highlights

  • Single cell RNA-Seq provides unprecedented insight into biological concerns at the level of individual cells (Hwang et al, 2018)

  • We propose RFCell, a gene selection strategy based on permutation and random forest, which uses supervised classification in pattern recognition to determine the best subset of genes for cell type recognition without referring to any known transcriptome profile or cell related information

  • SC3 is a user-friendly tool for unsupervised clustering, which methods include gene filtering, similarity calculation, Transformations, k-means, consensus clustering, and hierarchical clustering of the results obtained by consensus clustering

Read more

Summary

INTRODUCTION

Single cell RNA-Seq (scRNA-Seq) provides unprecedented insight into biological concerns at the level of individual cells (Hwang et al, 2018). Before downstream analysis, researchers usually use certain feature selection methods to extract scRNA-seq data. By generating twodimensional embedding of high-dimensional data, t-distributed stochastic neighborhood embedding (t-SNE) (Linderman and Steinerberger, 2019) is an effective non-linear dimensionality reduction technology that has attracted more and more scientific attention It has been widely popular in the field of scRNA-seq data research. Wang et al (2019) proposed a new marker selection strategy SCMarker to accurately delineate cell types in scRNA-seq data by identifying genes that have bi/multimodally distributed expression levels and are co-or mutuallyexclusively expressed with some other genes. Expr is a gene selection method based on scRNA-Seq sequencing data This method only retains the genes with the highest average expression (logarithmic normalized count) value in all cells. After using RFCell for gene selection on 10 scRNA-seq data sets, we found that the accuracy of the average results is higher than that of using conventional gene selection strategies

Method
RESULTS
DISCUSSION AND CONCLUSION
DATA AVAILABILITY STATEMENT
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call