Abstract

Objective: Millions of people including children and pregnant women are affected by Diabetes mellitus. Undiagnosed diabetes can affect entire body system including cardiac attacks, chronic kidney disease, foot ulcers and damage to the eyes Therefore an intelligent model should be developed for early detection of diabetes. Method: Data preprocessing is an important step in building classification models. Pima Indian Diabetes dataset from University of California Irvine (UCI) repository is a challenging dataset with more number (48%) of missing values. Different steps of data preprocessing is performed on Pima Diabetes to improve the accuracy of the classification model. The proposed model includes outlier removal and imputation at stage 1, normalization at stage 2 and balancing the dataset at stage 3. After each stage of preprocessing, the model is evaluated using three classifiers: Support Vector Machine (SVM), Random Forest (RF) and K-nearest neighbor (Knn). Findings: It is clearly proved that after each stage of preprocessing, the classification accuracy increases. On completing all 3 stages of preprocessing, the diabetes dataset achieves a highest accuracy (82.14%) and balanced accuracy (81.94%) with Random Forest classifier when compared to SVM and Knn. Novelty/Improvements: The preprocessing steps, replacing the outliers using 5 and 95 percentile values with median imputation followed by Z-score normalization and balancing the dataset using smote improves the quality of Pima Diabetes dataset, thereby classification accuracy of the model increases. The same data preprocessing methods can also be applied to different datasets or different classifier models. Keywords: Balanced Dataset, Imputation, Normalization, Outlier Removal, Random Forest

Highlights

  • IntroductionThe data obtained may not be in a proper format for data analysis, raw data need to be preprocessed carefully for proper diagnosis of disease[1]

  • Enormous amount of data is available in the area of medical science

  • Data preprocessing is an important step in data mining which involves data transformation, imputation, outlier removal, normalization, feature selection and dimensionality reduction[2]

Read more

Summary

Introduction

The data obtained may not be in a proper format for data analysis, raw data need to be preprocessed carefully for proper diagnosis of disease[1]. Data preprocessing is an important step in data mining which involves data transformation, imputation, outlier removal, normalization, feature selection and dimensionality reduction[2]. It is not necessary to involve all the steps of data preprocessing, but according to the nature of the data available, the required steps can be included in the model. Outlier is a data point that is present far outside from rest of the data or population. They will adversely affect the results of statistical analysis.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call