ABSTRACTMachine learning is important in the treatment of heart disease because it is capable of analyzing large amounts of patient data, such as medical records, imaging tests, and genetic information, in order to identify patterns and predict the risk of developing heart disease. However, most ML algorithms require more accurate data in order to build an accurate prediction model and do not tolerate missing values. Handling missing risk factors is critical during dataset preprocessing and becomes more difficult when the risk factors are completely missing. Removing this completely missing feature may result in the loss of critical information, but there are no readily available imputation methods, which presents a significant challenge. To overcome this difficulty, in this study, we take an attempt to impute using statistical multiple linear regression and Huber regression (HR) methods using four blended datasets (Statlog, Cleveland, Hungarian, and Switzerland) sourced from the UCI ML repository. The entire dataset comprises 14 attributes, including one target variable; however, in the Switzerland dataset, one feature value (“serum cholesterol”) is entirely missing. Missing “serum cholesterol” is recognized as a predisposing factor including “chest pain,” “supreme heartbeat rate,” “type of defect,” “exercise induced ST stress related to rest,” and “exercise generated angina” in the proposed imputation methods. We also proposed applying the majority voting ensemble technique in an individual's and integrated dataset using ML algorithms as part of the risk factor identification strategy. The results show that our proposed stacked algorithm for the combined dataset with the ensemble features significantly improved accuracy by 93.47%, and an AUC score of 94.50% demonstrated more accurate and early prediction than the previous research and also provided the model's diversity, resilience, generalization, and adaptability to varied datasets.
Read full abstract