Abstract

The microarray cancer data obtained by DNA microarray technology play an important role for cancer prevention, diagnosis, and treatment. However, predicting the different types of tumors is a challenging task since the sample size in microarray data is often small but the dimensionality is very high. Gene selection, which is an effective means, is aimed at mitigating the curse of dimensionality problem and can boost the classification accuracy of microarray data. However, many of previous gene selection methods focus on model design, but neglect the correlation between different genes. In this paper, we introduce a novel unsupervised gene selection method by taking the gene correlation into consideration, named gene correlation guided gene selection (G3CS). Specifically, we calculate the covariance of different gene dimension pairs and embed it into our unsupervised gene selection model to regularize the gene selection coefficient matrix. In such a manner, redundant genes can be effectively excluded. In addition, we utilize a matrix factorization term to exploit the cluster structure of original microarray data to assist the learning process. We design an iterative updating algorithm with convergence guarantee to solve the resultant optimization problem. Experimental results on six publicly available microarray datasets are conducted to validate the efficacy of our proposed method.

Highlights

  • During cell division and growth, abnormal changes often happen to genes, which results in varying cancers

  • In order to demonstrate that the gene subset selected by G3CS can obtain better classification results, we use three kinds of classification algorithms including Support Vector Machine (SVM), Random Forest (RF), and k -nearest neighbor (KNN) to test the selected gene subset obtained by different previous gene selection methods

  • Six publicly available microarray datasets are used in our experiments, which are colon cancer [71], B-cell chronic lymphocytic leukemia (CLL SUB 111), breast, lung, tumors-11, and global cancer map (GCM) (1CLL SUB 111 and lung can be downloaded from: http:// featureselection.asu.edu/datasets.php; breast and GCM can be downloaded from: http://portals.broadinstitute.org/cgibin/cancer/datasets.cgi; tumors-11 can be downloaded from: http://datam.i2r.a-star.edu.sg/datasets/krbd/index.html.) and are used to test the performance of the proposed G3CS and Methods F-test RLR WLMGS LNNFW GRSL-GS adaptive hypergraph embedded dictionary learning (AHEDL) G3CS

Read more

Summary

Introduction

During cell division and growth, abnormal changes often happen to genes, which results in varying cancers. For various microarray data, classifying the different types of tumors is an important task, but challenging due to the high dimensionality and small numbers of samples [13,14,15] since the small number of data samples with large number of genes can result in the “curse of dimensionality” and overfitting problems of data processing and learning models. The classification task for this kind of data is often challenging It has been verified by some existing biological experiments that only a very small proportion of genes contribute significantly to biological process and disease indication. It is necessary to select a subset of discriminative genes from high-dimensional microarray data to serve subsequent tasks [16,17,18,19,20,21,22,23,24,25]. Mathematical gene selection methods can be grouped into three classes, i.e., filter methods, wrapper methods, and embedded methods

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call