Abstract

Feature selection and classification are the main topics in microarray data analysis. Although many feature selection methods have been proposed and developed in this field, SVM-RFE (Support Vector Machine based on Recursive Feature Elimination) is proved as one of the best feature selection methods, which ranks the features (genes) by training support vector machine classification model and selects key genes combining with recursive feature elimination strategy. The principal drawback of SVM-RFE is the huge time consumption. To overcome this limitation, we introduce a more efficient implementation of linear support vector machines and improve the recursive feature elimination strategy and then combine them together to select informative genes. Besides, we propose a simple resampling method to preprocess the datasets, which makes the information distribution of different kinds of samples balanced and the classification results more credible. Moreover, the applicability of four common classifiers is also studied in this paper. Extensive experiments are conducted on six most frequently used microarray datasets in this field, and the results show that the proposed methods have not only reduced the time consumption greatly but also obtained comparable classification performance.

Highlights

  • The invention of DNA microarray technology has spawned massive gene expression microarray data, which brings a new way for the gene-related studies, mainly gene recognition and disease diagnosis [1]

  • In order to further speed up the process of feature selection, we introduce an efficient implementation of linear SVM to replace SVM and combine it with the improved RFE to conduct the procedure of feature selection just as SVM-RFE

  • In [2], C4.5, naïve Bayes and SVM are used to conduct experiments on nine microarray datasets, and the results prove that SVM performs better. [1] draws a similar conclusion

Read more

Summary

Introduction

The invention of DNA microarray technology has spawned massive gene expression microarray data, which brings a new way for the gene-related studies, mainly gene recognition and disease diagnosis [1]. The characteristics of these data have remained almost unchanged. Among these characteristics, small sample size, high dimensions and class imbalance are the most typical issues to overcome [2]. Gene recognition is to find the genes that strongly associated with specific diseases, so it is a feature selection task. Disease diagnosis is essentially a classification task. The small size with a large number of features of the training dataset can lead to faulty generalization ability of the classification model [3]. Considering the characteristics of microarray data with small sample size and high dimensionality, it is necessary to reduce the dimensions before the classification. Feature selection is currently a good choice for dimensionality reduction for microarray data.

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call