This paper presents a comprehensive exploration of machine learning algorithms (MLAs) and feature selection techniques for accurate heart disease prediction (HDP) in modern healthcare. By focusing on diverse datasets encompassing various challenges, the research sheds light on optimal strategies for early detection. MLAs such as Decision Trees (DT), Random Forests (RF), Support Vector Machines (SVM), Gaussian Naive Bayes (NB), and others were studied, with precision and recall metrics emphasized for robust predictions. Our study addresses challenges in real-world data through data cleaning and one-hot encoding, enhancing the integrity of our predictive models. Feature extraction techniques—Recursive Feature Extraction (RFE), Principal Component Analysis (PCA), and univariate feature selection—play a crucial role in identifying relevant features and reducing data dimensionality. Our findings showcase the impact of these techniques on improving prediction accuracy. Optimized models for each dataset have been achieved through grid search hyperparameter tuning, with configurations meticulously outlined. Notably, a remarkable 99.12 % accuracy was achieved on the first Kaggle dataset, showcasing the potential for accurate HDP. Model robustness across diverse datasets was highlighted, with caution against overfitting. The study emphasizes the need for validation of unseen data and encourages ongoing research for generalizability. Serving as a practical guide, this research aids researchers and practitioners in HDP model development, influencing clinical decisions and healthcare resource allocation. By providing insights into effective algorithms and techniques, the paper contributes to reducing heart disease-related morbidity and mortality, supporting the healthcare community's ongoing efforts.
Read full abstract