Abstract

Background: Breast Cancer (BC) is a known global crisis. The World Health Organization reports a global 2.09 million incidences and 627,000 deaths in 2018 relating to BC. The traditional BC screening method in developed countries is mammography, whilst developing countries employ breast self-examination and clinical breast examination. The prominent gold standard for BC detection is triple assessment: i) clinical examination, ii) mammography and/or ultrasonography; and iii) Fine Needle Aspirate Cytology. However, the introduction of cheaper, efficient and noninvasive methods of BC screening and detection would be beneficial.Design and methods: We propose the use of eight machine learning algorithms: i) Logistic Regression; ii) Support Vector Machine; iii) K-Nearest Neighbors; iv) Decision Tree; v) Random Forest; vi) Adaptive Boosting; vii) Gradient Boosting; viii) eXtreme Gradient Boosting, and blood test results using BC Coimbra Dataset (BCCD) from University of California Irvine online database to create models for BC prediction. To ensure the models’ robustness, we will employ: i) Stratified k-fold Cross- Validation; ii) Correlation-based Feature Selection (CFS); and iii) parameter tuning. The models will be validated on validation and test sets of BCCD for full features and reduced features. Feature reduction has an impact on algorithm performance. Seven metrics will be used for model evaluation, including accuracy.Expected impact of the study for public health: The CFS together with highest performing model(s) can serve to identify important specific blood tests that point towards BC, which may serve as an important BC biomarker. Highest performing model(s) may eventually be used to create an Artificial Intelligence tool to assist clinicians in BC screening and detection.Significance for public healthThis study could potentially identify important Breast Cancer (BC) biomarkers based on patients’ routine anthropometric blood data. This will be attempted using correlation-based feature selection algorithm, together with highest performing machine learning model(s) from this study, and publicly available BC Coimbra Dataset from University of California Irvine database. The biomarkers may provide direction for clinicians to explore in future BC clinical trials. Trials will serve to validate biomarkers from this study and could be introduced in clinical settings globally as an easy, cost-effective first step for BC screening and detection. An Artificial Intelligence tool can eventually be created using highest performing model(s). Clinicians can input patient-specific biomarkers into the tool. The tool would output the likelihood of patients having BC, with a certain level of accuracy. This envisioned process could serve to eventually revolutionize the early prediction of BC in patients and consequently, a reduction in BC mortality rate.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call