Abstract
Machine Learning (ML) in the healthcare industry has newly made headlines. Several ML models developed in the state of the art frameworks with different databases. However, improvements are still required in terms of performance to bring the robustness of the ML models in accurate prediction of heart diseases. The main impetus of the work is to propose a new ML pipeline for accurate prediction of heart disease. It includes pre-processing and entropy based feature engineering (FE) approach to produce high quality features to provide better model performance. The heart disease dataset is curated by combining Cleveland, V A medical center, Hungarian and Switzerland databases over 14 common attributes. Imputing missing values (IMV) and Outliers are removed (OR) based on the relation exist between healthcare attributes and Mahalanobis distance respectively in the curated heart disease dataset (HDD). Experimental results revealed that the IMV + OR pre-processing dominates with better performance than other pre-processing methods applied for model evaluation. Analyses were carried out with different ML models where HDD is subjected to IMV + OR processing with Independent Component Analysis (ICA), Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA) and entropy based FE (proposed). Employing proposed entropy based FE with IMV + OR pre-processing has shown remarkable improvement in respect of all metrics for NB and LR classifiers. Further, experimental results shown that the ensemble model (LR + NB) performed well under proposed pipeline, with AUC (96.8%), Accuracy (92.7%), Specificity (91.5%), Precision (92.5%) and F1 Score (0.931) which outperformed the state of the art results.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have