Abstract

Breast cancer remains to be a leading cause of cancer-related deaths among women. Mortality is mainly attributed to metastasis and recurrence. Hence, early detection of breast cancer recurrence has become a real-world medical problem. Using data mining approaches, we compared four popular machine learning models (Logistic Regression, Naïve Bayes, K-Nearest Neighbors, and Support Vector Machines) on a high-dimensional but very small dataset, the Wisconsin Prognostic Breast Cancer Data Set for classifying breast cancer recurrences on four different configurations: a) only scaling applied, b) scaling with PCA, c) scaling with PCA and oversampling of minority class, and d) only with selected features (i.e. choose only one from each set of features that have high correlation with each other). Our results showed that Logistic Regression provided the best scores in almost all metrics (precision, recall, accuracy, f1 score (weighted), AUROC, AUPROC, and Cohen Kappa statistic in all four configurations, followed by Support Vector Machines, and then by K-Nearest Neighbors. Naive Bayes performed poorly especially in the scaling with PCA configuration, however, when we retained only one of many features that have high correlations with each other, Naïve Bayed performance improved. KNN improved its recall with oversampling while SVM’s accuracy score has been fairly constant in all four configurations. Based on this study, the Logistic Regression model can serve as a potential model for predicting breast cancer recurrence that would enable clinicians to propose treatment options based on whether patient’s features correspond to a good or bad prognosis (recurrence). This indicates the clinical utility of data mining methods for the early detection of breast cancer recurrence in post-surgical patients to save lives.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call