Cancer disease has been classified as a perilous disease for humans, being the second leading cause of death globally. Even advanced-stage diagnosis may not be effective in preventing patient mortality. Therefore, it is important to establish a sustainable framework that predicts reliable estimates for an early cancer diagnosis. In this paper, a new two-phase feature (gene) selection approach is presented. In the first phase, the kernel Shapley value (kSV) that is based on the cooperative game-theoretic feature extraction approach is utilized to extract the important feature from the high dimensional gene expression data. In the second phase, Harris hawks optimizer (HHO) algorithm is utilized to further optimize the most effective feature extracted by kSV. Next, to evaluate the effectiveness of our proposed algorithm, we conduct extensive experiments on eight benchmark high-dimensional gene expression datasets, comparing them with other state-of-the-art techniques. We employ three classifiers, namely support vector machines (SVM), Naive Bayes (NB), and K-nearest neighbors (KNN), to assess the selected genes efficacy and their impact on classification accuracy. The experimental results demonstrate that the proposed method, particularly when combined with the SVM classifier, outperforms other gene selection methods. The evaluation metrics, including accuracy, precision, recall, F1-score, ROC-AUC, box plot, and radar plot, consistently indicate the superiority of kSV-HHO across all tested datasets. Moreover, the comparative and statistical analysis reveals that our proposed method excels in identifying the most relevant features for cancer diagnosis compared to other gene selection approaches. This makes our framework a valuable tool for cancer research and clinical practice, potentially enhancing the accuracy of early cancer diagnosis using high-dimensional gene expression biomedical data.
Read full abstract