Diabetes disease detection and classification on Indian demographic and health survey data using machine learning methods

Puneeth N Thotad,Geeta R Bharamagoudar,Basavaraj S Anami

doi:10.1016/j.dsx.2022.102690

Abstract

Background & aimDiabetes mellitus has become one of the out brakes causing major health issues in developing countries like India. The need for leveraging technology is felt in diabetes management. The main objective of this work is to deploy machine learning methods for the detection and classification of diabetes having clinical relevance. MethodsIndian demographic and health survey-2016 dataset is considered and determined the risk factors for continuous and categorical data. Kernel entropy component analysis is used for the dimensionality reduction of the feature set. Predictive exploration-based machine learning methods like logistic regression, gaussian naive Bayes, linear discriminant analysis, support vector classifier, k-nearest neighbor, decision tree, extreme gradient boosting, kernel entropy component analysis, and random forest are deployed in the work. The deployed methodology has three phases: feature extraction, classification, and prediction. ResultsRandom Forest gave the maximum classification accuracy of 99.84% and 96.75% for imbalanced and kernel entropy component analysis-induced balanced datasets (using synthetic minority oversampling technique) respectively. The maximum precision of 99.64% is obtained using a support vector classifier on the balanced dataset. The area under the curve is 99%, which is observed from kernel entropy component analysis induced random forest on the balanced dataset. All other models performed moderately when applied to kernel entropy component analysis trained dataset. ConclusionsRandom Forest model performed better in comparison with other models. The overall performance of the machine learning models can be improved by training the diabetes dataset using kernel entropy component analysis.

Full Text