Alzheimer’s disease (AD) classification, which is crucial for identifying AD-associated genes, relies heavily on effective feature selection (FS) to tackle the curse of dimensionality. Traditional methods like filter, wrapper, and embedded techniques have their drawbacks, including ignoring feature independence, sensitivity to classifier choices, and high computational costs. Hybrid approaches combining these methods seek to harness their collective strengths but face challenges, particularly in selecting the optimal number of features from each method. This selection is typically manual or requires time-intensive k-fold cross-validation (KFCV), significantly increasing computational demands and complicating the process with the need for extensive parameter optimization across families, thereby escalating the complexity and resource requirements of model development. To overcome these challenges, this work proposes a framework for optimal FS and classification in AD using a combination of filter and embedded techniques, enhanced with hyperparameter tuning. Firstly, gene expression data (GED) from the AD Neuroimaging Initiative (ADNI) is preprocessed. Then, Chi-square filter selection is applied to decrease correlated features. Next, Logistic Regression with ElasticNet penalty (LREN) is employed to further refine the feature set. Finally, Bayesian Optimization (BO) is introduced to automatically determine the optimal number of features (k for Chi-square and max_features for LREN), iteratively evaluating different combinations to find the set that maximizes model accuracy. The selection process is used primarily in the function of BO to tune automatically the (k and max_features while minimizing the number of features and maximizing the accuracy of the Support Vector Machine (SVM). The SVM was used as a classifier to overcome the problem of embedded selection sensitivity to the model of selection. The tuned features are then used to select relevant features from the ADNI dataset and fitted to different models. We evaluated the performance of five classifiers—logistic regression (LR), SVM, Ridge Classifier (RC), stochastic gradient descent classifier (SGD), and Gaussian Naïve Bayes(GNB) across various metrics. Among these, SVM achieved 100% performance in all metrics. This approach significantly reduced the FS time and the number of initial features to 0.6% and 0.02%, respectively. Notably, it identified 6 out of 20 selected features as directly AD-related. Comparative analysis reveals that the proposed method outperforms existing approaches on the ADNI dataset and other datasets. Statistical tests were conducted to assess the significance of the results compared to other methods, confirming a significant improvement and underscoring the effectiveness of the proposed framework in optimal FS and biological relevance confirmation.
Read full abstract