Abstract

Breast cancer is the most prevalent type of cancer in women. Risk factor assessment can aid in directing counseling regarding risk reduction and breast cancer surveillance. This research aims to (1) investigate the relationship between various risk factors and breast cancer incidence using the BCSC (Breast Cancer Surveillance Consortium) Risk Factor Dataset and create a prediction model for assessing the risk of developing breast cancer; (2) diagnose breast cancer using the Breast Cancer Wisconsin diagnostic dataset; and (3) analyze breast cancer survivability using the SEER (Surveillance, Epidemiology, and End Results) Breast Cancer Dataset. Applying resampling techniques on the training dataset before using various machine learning techniques can affect the performance of the classifiers. The three breast cancer datasets were examined using a variety of pre-processing approaches and classification models to assess their performance in terms of accuracy, precision, F-1 scores, etc. The PCA (principal component analysis) and resampling strategies produced remarkable results. For the BCSC Dataset, the Random Forest algorithm exhibited the best performance out of the applied classifiers, with an accuracy of 87.53%. Out of the different resampling techniques applied to the training dataset for training the Random Forest classifier, the Tomek Link exhibited the best test accuracy, at 87.47%. We compared all the models used with previously used techniques. After applying the resampling techniques, the accuracy scores of the test data decreased even if the training data accuracy increased. For the Breast Cancer Wisconsin diagnostic dataset, the K-Nearest Neighbor algorithm had the best accuracy with the original dataset test set, at 94.71%, and the PCA dataset test set exhibited 95.29% accuracy for detecting breast cancer. Using the SEER Dataset, this study also explores survival analysis, employing supervised and unsupervised learning approaches to offer insights into the variables affecting breast cancer survivability. This study emphasizes the significance of individualized approaches in the management and treatment of breast cancer by incorporating phenotypic variations and recognizing the heterogeneity of the disease. Through data-driven insights and advanced machine learning, this study contributes significantly to the ongoing efforts in breast cancer research, diagnostics, and personalized medicine.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call