Abstract

AbstractHeart disease is among the most prevalent medical conditions globally, and early diagnosis is vital to reducing the number of deaths. Machine learning (ML) has been used to predict people at risk of heart disease. Meanwhile, feature selection and data resampling are crucial in obtaining a reduced feature set and balanced data to improve the performance of the classifiers. Estimating the optimum feature subset is a fundamental issue in most ML applications. This study employs the hybrid Synthetic Minority Oversampling Technique-Edited Nearest Neighbor (SMOTE-ENN) to balance the heart disease dataset. Secondly, the study aims to select the most relevant features for the prediction of heart disease. The feature selection is achieved using multiple base algorithms at the core of the recursive feature elimination (RFE) technique. The relevant features predicted by the various RFE implementations are then combined using set theory to obtain the optimum feature subset. The reduced feature set is used to build six ML models using logistic regression, decision tree, random forest, linear discriminant analysis, naïve Bayes, and extreme gradient boosting algorithms. We conduct experiments using the complete and reduced feature sets. The results show that the data resampling and feature selection leads to improved classifier performance. The XGBoost classifier achieved the best performance with an accuracy of 95.6%. Compared to some recently developed heart disease prediction methods, our approach obtains superior performance.KeywordsFeature selectionHeart diseaseMachine learningSMOTE-ENNXGBoost

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call