Abstract

ObjectiveEarly disease screening and diagnosis are important for improving patient survival. Thus, identifying early predictive features of disease is necessary. This paper presents a comprehensive comparative analysis of different Machine Learning (ML) systems and reports the standard deviation of the results obtained through sampling with replacement. The research emphasises on: (a) to analyze and compare ML strategies used to predict Breast Cancer (BC) and Cardiovascular Disease (CVD) and (b) to use feature importance ranking to identify early high-risk features.ResultsThe Bayesian hyperparameter optimization method was more stable than the grid search and random search methods. In a BC diagnosis dataset, the Extreme Gradient Boosting (XGBoost) model had an accuracy of 94.74% and a sensitivity of 93.69%. The mean value of the cell nucleus in the Fine Needle Puncture (FNA) digital image of breast lump was identified as the most important predictive feature for BC. In a CVD dataset, the XGBoost model had an accuracy of 73.50% and a sensitivity of 69.54%. Systolic blood pressure was identified as the most important feature for CVD prediction.

Highlights

  • Modern medical methods prevent disease through early intervention rather than treatment after diagnosis

  • In a Breast Cancer (BC) diagnosis dataset, the Extreme Gradient Boosting (XGBoost) model had an accuracy of 94.74% and a sensitivity of 93.69%

  • Unlike traditional grid search and random search methods, Bayesian parameter optimization algorithms based on Gaussian processes can find stable hyperparameters, and they are widely used in machine learning [24]

Read more

Summary

Results

Feature selection The purpose of feature selection is to reduce the dimensions, which may improve the generalization of our algorithm [20,21,22]. We selected the features by analyzing the correlations among features in the BC diagnosis dataset. Additional file 1: Table S1 illustrates the correlation among the features, and the additional doc file contains more information (see Additional file 1). After performing feature selection in the BC diagnosis dataset, six features were retained. Additional file 1: Fig. S1 illustrates the feature selection process for the CVD dataset; the additional doc file contains more information (see Additional file 1). In the small BC diagnostic dataset, XGBoost performed better than LightGBM, GBDT, LR, RF, BPNN and DT but was not as stable as GBDT. In the large CVD dataset, XGBoost’s classification performance was relatively stable (Table 1 and Fig. 2a–d). In the CVD dataset, the patient’s systolic blood pressure was the most important feature for predictions (Fig. 2f )

Introduction
Main text
Discussion
Limitations
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call