Improved logistic regression model for diabetes prediction by integrating PCA and K-means techniques

Changsheng Zhu,Christian Uwa Idemudia,Wenfang Feng

doi:10.1016/j.imu.2019.100179

Changsheng Zhu, Christian Uwa Idemudia + Show 1 more

Open Access

https://doi.org/10.1016/j.imu.2019.100179

Copy DOI

Abstract

Abstract Diabetes causes a large number of deaths each year and a large number of people living with the disease do not realize their health condition early enough. In this study, we propose a data mining based model for early diagnosis and prediction of diabetes using the Pima Indians Diabetes dataset. Although K-means is simple and can be used for a wide variety of data types, it is quite sensitive to initial positions of cluster centers which determine the final cluster result, which either provides a sufficient and efficiently clustered dataset for the logistic regression model, or gives a lesser amount of data as a result of incorrect clustering of the original dataset, thereby limiting the performance of the logistic regression model. Our main goal was to determine ways of improving the k-means clustering and logistic regression accuracy result. Our model comprises of PCA (principal component analysis), k-means and logistic regression algorithm. Experimental results show that PCA enhanced the k-means clustering algorithm and logistic regression classifier accuracy versus the result of other published studies, with a k-means output of 25 more correctly classified data, and a logistic regression accuracy of 1.98% higher. As such, the model is shown to be useful for automatically predicting diabetes using patient electronic health records data. A further experiment with a new dataset showed the applicability of our model for the predication of diabetes.

Full Text