Abstract

Gene expression data of cancer has a huge feature set size, making its categorization a challenge for the existing classification methods. It contains redundancy, noise, and irrelevant genes. Therefore, feature selection/reduction plays a crucial role in the classification of such gene expression datasets. This work presents an ensemble of three filter methods, namely, Symmetrical Uncertainty (SU), chi square (X2), and Relief to reduce the feature dimensions by eliminating redundant and noisy genes. The present work designs a novel heuristic called Local Search-based Feature Selection (LSFS) that further reduces noise generated by the ensemble method. The resulting selected features are then optimized using a genetic algorithm. Afterwards, the optimal set of features is classified using three models; Support Vector Machine (SVM), k-NN (k-nearest neighbor), and Random Forest (RF) to find cancer relevant genes. Experiments are conducted using six benchmark datasets. The obtained results are compared with five state-of-the-art algorithms based on accuracy, sensitivity, specificity, F-measure, entropy, and precision. Additional experiments are carried out by manipulating the SVM kernel as a fitness value as well as using multiple distance measures and various values of k for k-NN. Prediction accuracy of the proposed system on the six benchmark datasets is 99%, 90%, 98%, 94%, 98%, and 99%. Significant outcomes obtained from experimental analysis indicate that the proposed approach improves classification of cancerous gene expression data and can be used as a practical tool for the analysis of gene expression data.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call