Breast cancer (BC) is now identified as a disease with a significant impact on morbidity and mortality that is growing and widespread worldwide. This study uses a publicly available clinical dataset of 699 patients from the University of Wisconsin with 9 variables: (1) clump thickness, (2) uniformity of cell size, (3) uniformity of cell shape, (4) marginal adhesion, (5) single epithelial cell size, (6) bare nuclei, (7) bland chromatin, (8) normal nucleoli, and (9) mitoses. This dataset has been used for many studies in the past to pinpoint critical factors in patient diagnosis. Here, we use this data to ensure its unbiasedness and accuracy. We then apply principal component analysis and machine learning models to identify factors in diagnosing a malignant or benign tumor. We investigate and compare the classification accuracy of different machine learning models, including tree, linear discriminant, quadratic discriminant, logistic regression, naive Bayes, support vector machine (SVM), K-nearest neighbor (KNN), ensemble, neural network, and kernel. The best models that can achieve the highest accuracy are medium Gaussian SVM, coarse Gaussian SVM, and cosine KNN, with an accuracy of 96.5%. The principal component analysis method is then performed to identify crucial components and build an accurate model with fewer parameters. The medium Gaussian SVM has the highest cross-validation classification accuracy of 96.98% and requires only three predictors: normal nucleoli, bare nuclei, and cell size uniformity.
Read full abstract