Abstract

This study aims to investigate the best method for imputing missing values in remote healthcare data set. Missing value means an empty field in a health record. It may occur for three major reasons- (i) the parameter was not measured (ii) measured but not recorded and (iii) lost during communications. Our case study, Portable Health Clinic (PHC) data has been collected from multiple regions, by different authorities in different time. PHC data contains manual errors too. Missing and erroneous data are problematic for data analysis and for making accurate predictions. Hence, it is necessary to detect and eliminate error data and also fill the empty fields. Missing value imputation methods are widely known for processing numerical data. PHC data has both numerical and categorical data which makes it difficult to impute. We came up with a new data processing mechanism to feed into existing machine learning algorithm. To test our idea, we used a complete PHC data set (numerical only) without any missing values. Then we generated missing values by randomly erasing a part of the data set. We used several existing imputation methods and our proposed method on the same target data set to compare their performances. It is found that the Mean Imputer, kNN and MissForest are not effective. Iterative Imputer predicted best in 7 features and ours in 4 cases. Therefore, it can be concluded that the effectiveness of imputation methods may vary depending on the specific data set and features. Our future work is to include the categorical data and monitor the performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call