Data balancing methods eliminate the problem of imbalanced class distributions, which often lead to the majority class being well-learned while the minority class remains underrepresented, negatively affecting classification performance. This study applies data balancing to the healthcare domain, a critical field where classification success directly impacts human life. The primary aim is to introduce novel balancing methods while addressing the previously overlooked problem of optimizing data balancing ratios. Six healthcare datasets were used: Wisconsin Diagnostic Breast Cancer (WDBC), Wisconsin Prognostic Breast Cancer (WPBC), Z-Alizadeh Sani, Kidney, Diabetes, and Stroke, all characterized by significant diseases and imbalanced class distributions. Six balancing methods were tested, including synthetic minority oversampling technique (SMOTE), adaptive synthetic sampling (ADASYN), support vector machine-SMOTE (SVM-SMOTE), Borderline-SMOTE, cubic interpolation, and quadratic interpolation, with interpolation-based methods being adapted to this domain for the first time. The critical factor in data balancing is identifying the optimal ratio that maximizes classification performance. In this study, particle swarm optimization (PSO), whale optimization algorithm (WOA), and Optuna optimization methods were used to optimize balancing ratios via a custom-designed fitness function that simultaneously optimizes classification accuracy and resource consumption. Classification was conducted for three scenarios: full balance, optimized balance, and imbalance, using support vector machine (SVM), random forest (RF), and ensemble learning (EL) classifiers, allowing for extensive analysis. Each combination of balancing methods, classifiers, and optimization techniques was separately analyzed using metrics such as accuracy, precision, recall, F1-score, time, central processing unit (CPU) usage, and memory usage. As a result, the combination that optimally balances classification accuracy and resource consumption was determined for each dataset, providing both comprehensive analysis and insights into the impact of balancing ratio optimization on diagnostic success in health care.
Read full abstract