Abstract

Real medical datasets usually consist of missing data with different patterns which decrease the performance of classifiers used in intelligent healthcare and disease diagnosis systems. Many methods have been proposed to impute missing data, however, they do not fulfill the need for data quality especially in real datasets with different missing data patterns. In this paper, a four-layer model is introduced, and then a hybrid imputation (HIMP) method using this model is proposed to impute multi-pattern missing data including non-random, random, and completely random patterns. In HIMP, first, non-random missing data patterns are imputed, and then the obtained dataset is decomposed into two datasets containing random and completely random missing data patterns. Then, concerning the missing data patterns in each dataset, different single or multiple imputation methods are used. Finally, the best-imputed datasets gained from random and completely random patterns are merged to form the final dataset. The experimental evaluation was conducted by a real dataset named IRDia including all three missing data patterns. The proposed method and comparative methods were compared using different classifiers in terms of accuracy, precision, recall, and F1-score. The classifiers’ performances show that the HIMP can impute multi-pattern missing values more effectively than other comparative methods.

Highlights

  • Along with the reduced physical activity and the spread of sedentary life as well as consumption of unhealthy foods, the diabetes affliction age has been reduced, and its incidence rate has been increased [1,2,3]

  • In real-world medical datasets, the missing data usually occur with different patterns

  • Failure in identifying the type of missing data pattern and applying imputation methods regardless of the missingness type can reduce the performance of classifiers

Read more

Summary

Introduction

Along with the reduced physical activity and the spread of sedentary life as well as consumption of unhealthy foods, the diabetes affliction age has been reduced, and its incidence rate has been increased [1,2,3]. Diabetes is a leading cause of mortality and an expensive medical problem [5,6]. The real-world datasets may consist of all patterns MCAR, MAR, and MAR which are categorized in multi-pattern missing values. Missing completely at random (MCAR) pattern: The MCAR pattern occurs completely randomly throughout the dataset. In this type of missingness, a random subset of the missing observations has distributions similar to the observed values [37,56]. If feature Y has missing values, the MCAR pattern will occur when the missing values on feature Y are independent of all the observed features and values of Y.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call