Nearest neighbor imputation for categorical data by weighting of attributes

Shahla Faisal,Gerhard Tutz

doi:10.1016/j.ins.2022.01.056

Abstract

Missing values are a common phenomenon in modern medical research of complex diseases. The data often contains nominal or categorical variables, for example, single nucleotide polymorphisms (SNPs) in genetic studies. If the missing values are not handled properly, the downstream statistical analysis of incomplete data may be biased. While various imputation methods are available for metrically scaled variables, methods for categorical data are scarce. An imputation method that has been shown to work well for high dimensional metrically scaled variables is the imputation by nearest neighbor methods. In this paper, we propose a weighted nearest neighbors approach to impute missing values in categorical variables in high dimensional datasets. The proposed method explicitly uses the information on the association among attributes. Using different simulation settings, the performance is compared with available imputation methods. A variety of real data sets, containing heart, DNA, and lymphatic cancer, is also used to support the results obtained by simulations. The results show that the weighting of attributes yields smaller imputation errors than existing approaches like random forest and MICE.

Full Text