The stratified K-folds cross-validation and class-balancing methods with high-performance ensemble classifiers for breast cancer classification

Mahesh T R,Vinoth Kumar V,Dhilip Kumar V,Oana Geman,Martin Margala,Manisha Guduri

doi:10.1016/j.health.2023.100247

Abstract

Breast cancer is one of the most common causes of death among women, and early diagnosis is vital for reducing the fatality rate. This study evaluates the most widely used machine-learning breast cancer prediction and diagnosis methods. We use synthetic minority over-sampling to handle imbalanced data in the breast cancer diagnosis dataset obtained from the Wisconsin Machine Learning Repository. We use a variety of machine learning algorithms, including Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbours (KNN), Classification and Regression Tree (CART), Naive Bayes (NB), and well-known ensembles methods like Majority-Voting, eXtreme Gradient Boosting algorithm (XGBoost), and Random Forest (RF) for the breast cancer classification. The findings show that the Majority-Voting ensemble method, built on the top three classifiers (LR, SVM, and CART), outperforms all other individual classifiers and offers the highest accuracy of 99.3%.

Full Text