Abstract

Missing data is a common problem confronted by researchers in machine learning applications. Missing values affect both the performance of analysis tools, as well as the quality of the drawn decisions. This research aims to analyze the impact of four missing data treatment methods on the predictive accuracy of the C4.5 decision tree algorithm. It also investigates the imputation accuracy of each imputation method using a single dataset with missing values presented in a single variable. The work was performed under three missing data assumptions, namely, Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR) with three missingness’ rates: 5%, 10%, and 15%. The methods used to treat the missing data are: lite-wise deletion, mean/mode imputation, K-nearest neighbor imputation, and decision tree imputation. The results of the experiments showed that the C4.5 classifier achieved better performance under the MCAR assumption. While the mean/mode imputation has the highest C4.5 predictive accuracy under MAR and MNAR assumptions. The k-nearest neighbor method obtained the most accurate imputation result under the MCAR assumption, whereas mean/mode imputation was the most accurate method under the MAR assumption. On the other hand, the lowest imputation accuracy levels were achieved under the MNAR assumption attributed to the mean/mode imputation method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call