Abstract

AbstractWhen machine learning is used for the design of a prediction model in medical science, then higher accuracy is essential. It becomes difficult to achieve higher accuracy due to unavailability of values in certain fields of data set. Therefore, it is necessary to deal with the issue of missing values effectively. This research work focuses on an efficient way to handle missing values. Authors have proposed a systematic methodology for the identification of missing value. Authors have used Cleveland Heart disease dataset from the UCI (University of California, Irvine) repository to test their experiments. Missing values are imparted using three different approaches, namely random, MISSHASH & MISSFIB. Four imputation methods k-nearest neighbor (KNN), multivariate imputation by chained equations (MICE), mean, and mode imputation were analyzed with the help of four classifiers Naive Bayes (NB), support vector machine (SVM), logistic regression (LR), and random forest (RF). Root mean square error (RMSE) of classifiers was compared to find the combination of the best imputation method. It has found that MICE imputation method has performed better related to other imputation methods. Moreover, its accuracy is independent of classifier and missing value distribution.KeywordsKNN imputationMean imputationMode imputationMultivariate imputation by chained equationsRoot mean square error

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call