Abstract

Single-cell RNA sequencing (scRNA-seq) is emerging as a promising technology. There exist a huge number of genes in a scRNA-seq data. However, some genes are high quality genes, and some are noises and irrelevant genes because of unspecific technology reasons. These noises and irrelevant genes may have a strong influence on downstream data analyses, such as a cell classification, gene function analysis, and cancer biomarker detection. Therefore, it is very significant to obviate these irrelevant genes and choose high quality genes by gene selection methods. In this study, a novel gene selection and classification method is presented by combining the information gain ratio and the genetic algorithm with dynamic crossover (abbreviated as IGRDCGA). The information gain ratio (IGR) is employed to eliminate irrelevant genes roughly and obtain a preliminary gene subset, and then the genetic algorithm with a dynamic crossover (DCGA) is utilized to choose high quality genes finely from the preliminary gene subset. The main difference between the IGRDCGA and the existing methods is that the DCGA and IGR are integrated first and used to select genes from scRNA-seq data. We conduct the IGRDCGA and several competing methods on some real-world scRNA-seq datasets. The obtained results demonstrate that the IGRDCGA can choose high quality genes effectively and efficiently and outperforms the other several competing methods in terms of both the dimensionality reduction and the classification accuracy.

Highlights

  • In scRNA-seq data, there often are amounts of genes and may reach tens of thousands

  • We present a novel algorithm to address the gene selection and classification for scRNA-seq data by combining information gain ratio and genetic algorithm with dynamic crossover (IGRDCGA for short). e coding and the other details of the IGRDCGA are as follows

  • In order to evaluate the performances of the IGRDCGA, two frequently used clustering algorithms, k means and spectral clustering [57], a state-of-the-art single-cell classification algorithm SIMLR [58], are employed to compare it

Read more

Summary

Introduction

In scRNA-seq data, there often are amounts of genes and may reach tens of thousands. Some genes are irrelevant or unsuitable for classification tasks, and they may seriously affect the efficiency of downstream data analysis. In order to obviate these irrelevant genes and select high quality genes, an effective and efficient gene selection algorithm is vital. Eroglu and Kilic [3] integrated a genetic local search algorithm and a k-nearest neighbor classifier to select feature subset. How to correctly use the GA to address the gene selection and classification of scRNA-seq data is a significant issue to consider first. E study integrates the IGR and DCGA to address the gene selection and classification of scRNA-seq data and proposes a novel gene selection and classification algorithm. E IGRDCGA utilizes the IGR to eliminate irrelevant genes roughly and obtain a preliminary gene subset and employs the DCGA to choose high quality genes finely from the preliminary gene subset.

Related Work
Evaluation Metrics
Dataset and Preprocessing
The Proposed Algorithm
Numerical Results
Results
Conclusion and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call