Abstract

Early detection and control of diabetes can help to prevent associated long term health risks to heart, lungs, kidneys, neural system etc. In this work we have developed a boosting ensemble machine learning (ML) model to predict diabetes based on Pima Indian Diabetes Dataset (PIDD). The data set is preprocessed to enhance the learning ability of the model. We have used various data preprocessing techniques like standardization, outlier removal, data balancing and dimension reduction. The performance of various machine learning algorithms like Logistic Regression (LR), Random Forest (RF) classifier, AdaBoost and Extreme Gradient Boost (XGBoost) are compared to select the best model. The performance metric used for comparison consists of Accuracy, Recall, Precision, F1-Score and Area under ROC Curve (AUC-ROC). Since the application is medical diagnosis, the cost associated with false negative is of utmost importance, thus Recall value played significant role in selecting the best model. Among the basic ML, LR and RF, based models; RF with power transformer achieved highest prediction accuracy and recall value of 0.968 and 0.924 respectively. The boosting ensemble ML models predicted diabetes with better performance metric, which was further improved by hyper-parameter tuning. The AdaBoost based model achieved an accuracy of 0.966 and recall value of 0.97. The best model to predict diabetes based on PIDD, as per this work is hyper-parameter tuned XGBoost model with accuracy and recall value of 1.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call