Cancer classification using RNA sequencing gene expression data based on Game Shapley local search embedded binary social ski-driver optimization algorithms

Sana Afreen,Ajay Kumar Bhurjee,Rabia Musheer Aziz

doi:10.1016/j.microc.2024.111280

Abstract

Cancer remains a significant health concern due to its high mortality rates. Recent decades have witnessed substantial progress in cancer research, driven by advancements in high throughput sequencing technology and the application of diverse machine learning (ML) methods, particularly in the analysis of gene expression data. However, the proliferation of high-dimensional datasets, such as RNA-seq data, underscores the need for more robust ML techniques capable of efficiently handling large volumes of data to enable accurate treatment decisions. This paper introduces a novel hybrid feature selection (FS) algorithm, termed Game kernel SHapley Additive exPlanations (kSHAP), which combines with binary Social Ski Driver (bSSD), Adaptive Beta Hill Climbing (ABHC) and Late Acceptance Hill Climbing (LAHC) algorithms. The study comprehensively investigates three novel FS algorithms—kSHAP-bSSD, kSHAP-ABHC, and kSHAP-LAHC for cancer classification tasks using RNA sequencing (RNA-seq) datasets. An experiment conducted on five well-established RNA-seq cancer datasets: Lung Adenocarcinoma (LUAD), Stomach Adenocarcinoma (STAD), Breast Invasive Carcinoma (BRCA), lung squamous cell carcinoma (LUSC) and uterine corpus endometrial carcinoma (UCEC). The objective is to enhance cancer classification accuracy, robustness, and scalability using RNA-seq datasets. Additionally, the study evaluates six classifiers—Autoencoder (AE), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Naive Bayes (NB), Neural Network (NN), and Random Forest (RF) with AE consistently out performing others. Evaluation metrics include accuracy, recall, precision, box plot, F1-score, radar plot, confusion matrix, ROC and statistical analysis. Our approach is compared against recent state-of-the-art FS algorithms, showing improvements in gene selection and classification accuracy. The kSHAP-bSSD demonstrates superior performance across all datasets compared to traditional methods, achieving an accuracy rate of 99.9% in LUAD and exhibiting higher accuracy rates and robustness in STAD, BRCA, LUSC, and UCEC datasets. Assessment across multiple metrics affirms the superiority of kSHAP-bSSD and kSHAP-ABHC combinations, underscoring their effectiveness in cancer classification tasks.

Full Text