BackgroundCoronary heart disease (CHD) remains a prominent cause of mortality globally, necessitating early and accurate detection methods. Traditional diagnostic approaches can be invasive, costly, and time-consuming, necessitating the need for more efficient alternatives. This aimed to optimize the Light Gradient-Boosting Machine (LightGBM) algorithm to enhance its performance and accuracy in the early detection of CHD, providing a reliable, cost-effective, and non-invasive diagnostic tool. MethodsThe Framingham Heart Study (FHS) dataset publicly available on Kaggle was used in this study. Multiple Imputations by Chained Equations (MICE) were applied separately to the training and testing sets to handle missing data. Borderline-SMOTE (Synthetic Minority Over-sampling Technique) was used on the training set to balance the dataset. The LightGBM algorithm was selected for its efficiency in classification tasks, and Bayesian Optimization with Tree-structured Parzen Estimator (TPE) was employed to fine-tune its hyperparameters. The optimized LightGBM model was trained and evaluated using metrics such as accuracy, precision, and AUC-ROC on the test set, with cross-validation to ensure robustness and generalizability. FindingsThe optimized LightGBM model showed significant improvement in early CHD detection. The baseline LightGBM model with dropped missing values had an accuracy of 0.8333, sensitivity of 0.1081, precision of 0.3429, F1 score of 0.1644, and AUC of 0.6875. With MICE imputation, performance improved to an accuracy of 0.9399, sensitivity of 0.6693, precision of 0.9043, F1 score of 0.7692, and AUC of 0.9457. The combined approach of Borderline-SMOTE, MICE imputation, and TPE for LightGBM achieved an accuracy of 0.9882, sensitivity of 0.9370, precision of 0.9835, F1 score of 0.9597, and AUC of 0.9963, indicating a highly effective and robust model. InterpretationThe optimized model demonstrated outstanding performance in early CHD detection. The study's strengths include its comprehensive approach to addressing missing data and class imbalance and the fine-tuning of hyperparameters through Bayesian Optimization. However, there is a need to test with other datasets for its generalizability to be well-established. This study provides a strong framework for early CHD detection, improving clinical practice by allowing for more precise and dependable diagnostics and effective interventions.
Read full abstract