Abstract

Medical industry contains a large amount of sensitive data that must be evaluated in order to get insight into records. The nonlinearity, non-normality, correlation structures and complicated diabetic medical records, on the other hand, makes accurate predictions difficult. The Pima Indian Diabetes dataset is one of them, owing to the dataset's imbalance, large number of missing values and difficulty in identifying highly risk factors. Some of these challenges have been solved using computational approaches such as machine learning methods, but they have not performed ideally, with pre-processing techniques being recognized as critical to achieving correct findings. The goal of this work is to apply multiple pre-processing approaches to increase the accuracy of some simple models. These multiple pre-processing techniques are median imputation in which null values are substituted by finding the median of the input variables dependent on whether or not the patient is diabetic and then follow by applying oversampling and under-sampling procedures on both majority and minority votes. These votes are applied in order to address the problem of class imbalance as pointed out from the literature. Finally, the dimension reduction Pearson correlation is used to detect high-risk features since it is effective at quantifying information between attributes and their labels. In this study, these techniques are applied in the same order to Linear Regression, Naive Bayes, Decision Tree, K Nearest Neighbor, Random Forest and Gaussian Boosting classifiers. The utility of the techniques on the mentioned classifiers is validated using performance measures such as Accuracy, Precision and Recall. The Random Forest Classifier is found to be the best-improved model, with 95 percent accuracy, 94.25 percent precision and 95.35 percent recall. Medical practitioners may find the provided strategies beneficial in improving the efficiency of diabetes analysis. Keywords— Classifiers, diabetes, Pima Indian Diabetes dataset, pre-processing techniques

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call