Increasing the awareness of how incomplete data affects learning and classification accuracy has led to increasing numbers of missing data techniques. This article investigates the robustness and accuracy of seven popular techniques for tolerating incomplete training and test data for different patterns of missing data—different proportions and mechanisms of missing data on resulting tree-based models. The seven missing data techniques were compared by artificially simulating different proportions, patterns, and mechanisms of missing data using 21 complete datasets (i.e., with no missing values) obtained from the University of California, Irvine repository of machine-learning databases (Blake and Merz, 1998). A four-way repeated measures design was employed to analyze the data. The simulation results suggest important differences. All methods have their strengths and weaknesses. However, listwise deletion is substantially inferior to the other six techniques, while multiple imputation, that utilizes the expectation maximization algorithm, represents a superior approach to handling incomplete data. Decision tree single imputation and surrogate variables splitting are more severely impacted by missing values distributed among all attributes compared to when they are only on a single attribute. Otherwise, the imputation—versus model-based imputation procedures gave—reasonably good results although some discrepancies remained. Different techniques for addressing missing values when using decision trees can give substantially diverse results, and must be carefully considered to protect against biases and spurious findings. Multiple imputation should always be used, especially if the data contain many missing values. If few values are missing, any of the missing data techniques might be considered. The choice of technique should be guided by the proportion, pattern, and mechanisms of missing data, especially the latter two. However, the use of older techniques like listwise deletion and mean or mode single imputation is no longer justifiable given the accessibility and ease of use of more advanced techniques, such as multiple imputation and supervised learning imputation.
Read full abstract