Missing data is an important problem in the analysis and classification of high dimensional data. The aim of this study is to compare the effects of four different missing data imputation methods on classification performance in high dimensional data. In this study, missing data imputation methods were evaluated using data sets, whose independent variables between mixed correlated with each other, for binary dependent variable, p=500 independent variables, n=150 units and 1000 times running simulation. Missing data structures were created according to different missing rates. Different datasets were obtained by imputing the missing values using different methods. Regularized regression methods such as lasso and elastic net regression were used for imputation, as well as tree-based methods such as support vector machine and classification and regression trees. At the end of simulation, the classification scores of the methods were obtained by gradient boosting machines and the missing data prediction performances were evaluated according to the distance of these scores from the reference. Our simulation demonstrates that regularized regression methods outperform tree-based methods in classifying high dimensional datasets. Additionally, it was found that the increase in the amount of missing values reduced the classification performance of the methods in high dimensional data.
Read full abstract