Abstract
The early diagnosis of diabetes helps in avoiding the major risks associated with the disorder. The proposed research involves the design of a machine learning pipeline which generates the most representative feature subset of minimal size that predicts the onset of Diabetes with highest accuracy. It employs a novel diabetes dataset which is gender-neutral and representative enough unlike the well-known PID dataset. The machine learning pipelines involve multiple feature engineering pipelines to generate a reduced feature subset which is fed into multiple heterogeneous classifiers. The feature engineering involves feature selection as well as feature extraction. The former uses the ANOVA filter and Crow Search Optimization algorithm. The latter employs the Singular Value Decomposition. The classification is performed on the preprocessed dataset using a wide range of heterogeneous classifiers like Naive Bayes’, Logistic Regression, K-Nearest Neighbor, Decision Trees, Support Vector Machine, Random Forest, AdaBoost, and GradientBoost as base learners followed by their stacking ensemble. The performance evaluation of each machine learning pipeline is done through Repeated Stratified K-fold Cross Validation using the metrics of accuracy, precision, recall, F1 Score and area under Receiver Operating Characteristic curve. For each pipeline, the number of features in the preprocessed dataset varies and the highest accuracy of 98.4% is achieved with Crow Search algorithm through a stacking ensemble of multiple heterogeneous classifiers. A comparative analysis with a recent related work on the same dataset shows that the proposed feature engineering pipelines with the same set of classifiers outperform with improved accuracy using a feature set of reduced size.
Highlights
A very common chronic disorder prevalent in the modern world is Diabetes Mellitus
The results of various experiments employing the different classifiers with proposed feature engineering pipelines(FEP) is described in multiple subsections
WORK In the current research, a novel diabetes dataset from the UCI repository is employed rather than the benchmark Pima Indian Diabetes (PID) dataset
Summary
A very common chronic disorder prevalent in the modern world is Diabetes Mellitus. It has become a serious health issue throughout the world irrespective of geographic boundaries. The disorder is associated with the insulin hormone produced by the pancreas. It occurs in one of the following forms: Type 1, Type 2 and Gestational diabetes [1]. Type 1 diabetes is caused when the body’s immune system causes the destruction of beta cells of the pancreas. The body has deficient insulin which makes the glucose absorption in
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.