A Hybrid Modified Deep Learning Data Imputation Method for Numeric Datasets

Nuran Peker,Cemalettin Kubat

doi:10.18201/ijisae.2021167931

Abstract

Missing data is a major problem in terms of both machine learning and data mining methods. Like most of these methods do not work with missing data, negative results may occur on the performance of the working ones, also. Imputation is a data preprocessing method used to replace missing data with appropriate values. This study aims at developing a hybrid modified imputation method based on deep learning approach. For this purpose, we use Random Forest and Datawig deep learning imputation (called RF-DLI) methods together. Datawig is a deep learning-based library that supports missing value imputation for all types of data. RF-DLI approach includes the following steps to impute missing data. First, the importance of each attribute of the dataset is determined with Random Forest (RF). Second, the most important 50% of the attributes are selected. Finally, each missing value is imputed with datawig (DLI) using these most important attributes. The study uses six real-world datasets from different fields with 30% missing data. The imputation performance of RF-DLI is compared to KNN, MICE, and MEAN imputation approaches in terms of MAE, RMSE, and R2 evaluation metrics. The results show that in most cases, the RF-DLI approach has better imputation performance than the other techniques mentioned.

Full Text