Optimizing machine learning algorithms for diabetes data: A metaheuristic approach to balancing and tuning classifiers parameters

Hauwau Abdulrahman Aliyu,Ibrahim Olawale Muritala,Habeeb Bello-Salau,Salisu Mohammed,Adeiza James Onumanyi,Ore-Ofe Ajayi

doi:10.1016/j.fraope.2024.100153

Abstract

Diabetes mellitus poses a global health concern, prompting the development of machine learning algorithms designed to construct a model for the accurate classification of patients, enabling precise diagnoses and early-stage treatment. However, the efficacy of classifying diabetes patients through machine learning relies on datasets, often plagued by imbalance, leading to biased classification and inaccurate diagnoses. Previous research attempts, employing techniques like random sampling (under-sampling and oversampling) and the Synthetic Minority Oversampling Technique (SMOTE), have struggled to achieve optimally balanced datasets. Additionally, setting the best parameters for machine learning classifiers remains a challenging task. To address these issues, this research focuses on devising a methodological metaheuristic optimization, a machine learning algorithm tailored for diabetes data balancing, and classifier hyperparameter tuning. Leveraging Particle Swarm Optimization (PSO) algorithm for diabetes data balancing and a genetic algorithm to select the optimal architecture for various machine learning classifiers. The study compares the performance of the proposed metaheuristic data balancer and classifier architecture parameter tuner using classification metrics (F1 score, Average Precision–Recall (APR), AUC, and accuracy). The PSO balanced dataset emerges as the most effective in classifying diabetes, with an Average Percentage Improvement (API) in classification performance metrics: 20.78% accuracy, 16.79% area under the curve for receiver operating characteristics, and a significant 32.78% enhancement in APR. Moreover, the XGBOOST classifier trained with a genetic algorithm demonstrates minimal computational training time for the Centre for Disease Control and Prevention (CDC) diabetes dataset compared to the artificial neural network and random forest classifier. Notably, the imbalanced CDC diabetes dataset exhibits the least APR compared to random under-sampling and the PSO data balancing technique.

Full Text