Abstract

Breast cancer is one of the deadliest diseases, claiming approximately 627,000 lives worldwide in 2018–2019. Therefore, early detection of breast cancer through automation in the prediction of the disease will help the medical industry to cure this disease at an early stage and thereby reduce the risk of death drastically. In the present study, the Breast Cancer Wisconsin (Diagnostic) Data Set has been taken from the University of California Irvine (UCI) Machine Learning Repository. The dataset (n=699) contained a total of 30 predictor parameters and one dependent parameter. The dependent variable referred to the type of cancer tissue, i.e., benign or malignant. To predict the type of cancer tissue present in the patient, prediction models were built using 1) Logistic Regression (LR), 2) Decision Tree Classifier (DTC), 3) Random Forest Classifier (RFC), 4) K Nearest Neighbor (KNN), 5) Support Vector Machine (SVM), and 6) Ada Boost Classifier (ABC). To improve the accuracy of the model, a correlation matrix was used and the top 8 features were selected. To improve the accuracy even further, the Synthetic Minority Oversampling Technique (SMOTE) was used to eliminate the problem of class imbalance, and then accuracy was compared before and after SMOTE. The Precision, Recall, and F1 scores are the performance metrics that have been taken into consideration for selecting the best model for the analysis. The results of the study reveal that the KNN algorithm gives the highest accuracy of 95.321% after the SMOTE technique is applied to each of the six algorithms. It has been revealed that while SMOTE aids in the accuracy of some algorithms, it affects the performance of others. This research may be turned into realistic tools that can be utilized in the medical field to more accurately predict the stage of disease for better treatment management.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call