Aim: This study aims to investigate and apply effective machine learning techniques for the early detection and precise diagnosis of breast cancer. The analysis is conducted using various breast cancer datasets, including Breast Cancer Wisconsin, Breast Cancer Diagnosis, NKI Breast Cancer, and SEER Breast Cancer datasets. The primary focus is on identifying key features and utilizing preprocessing methods to enhance classification accuracy. Methods: The datasets undergo several preprocessing steps, such as label encoding for categorical variables, linear regression for handling missing values, and Robust scaler normalization for data standardization. To address class imbalance, Tomek Link SMOTE is employed to improve dataset representation. Significant features are selected through L2 Ridge regularization, helping to pinpoint the most important predictors of breast cancer. A range of machine learning models, including decision tree, random forest, support vector machine (SVM), neural network, K-nearest neighbor, naïve bayes, extreme gradient boost (XGBoost), and AdaBoost, are applied for classification tasks. The performance of these models is assessed using metrics such as accuracy, precision, recall, F1-score, and the Kappa statistic. Additionally, the models' effectiveness is further evaluated using the receiver operating characteristic curve and precision-recall curve. Results: The XGBoost model achieved the best performance on both the breast cancer Wisconsin and diagnosis datasets. The SVM model reached 100% accuracy on the NKI breast cancer dataset, while the random forest model performed optimally on the SEER breast cancer dataset. The feature selection process through L2 Ridge regularization was crucial in enhancing the performance of these models. Conclusions: This work emphasizes the critical role of machine learning in improving breast cancer detection. By applying a combination of preprocessing techniques and classification models, the study successfully identifies significant features and boosts model performance. These findings contribute to the development of more accurate diagnostic tools, ultimately enhancing patient outcomes.
Read full abstract