Abstract

In most scientific studies such as data analysis, the existence of missing data is a critical problem, and selecting the appropriate approach to deal with missing data is a challenge. In this paper, the authors perform a fair comparative study of some practical imputation methods used for handling missing values against two proposed imputation algorithms. The proposed algorithms depend on the Bayesian Ridge technique under two different feature selection conditions. The proposed algorithms differ from the existing approaches in that they cumulate the imputed features; those imputed features will be incorporated within the Bayesian Ridge equation for predicting the missing values in the next incomplete selected feature. The authors applied the proposed algorithms on eight datasets with different amount of missing values created from different missingness mechanisms. The performance was measured in terms of imputation time, root-mean-square error (RMSE), coefficient of determination (R2), and mean absolute error (MAE). The results showed that the performance varies depending on missing values percentage, size of the dataset, and the missingness mechanism. In addition, the performance of the proposed methods is slightly better.

Highlights

  • Data that contains missing values have been considered as one of the main problems that prevent building an efficient model

  • Log scale is used in root-mean-square error (RMSE), mean absolute error (MAE), and imputation time comparisons because each of which has a different range of values

  • With regard to RMSE, MAE, and imputation time metrics, lower value is better, so they are gathered in the same figure

Read more

Summary

Introduction

Data that contains missing values have been considered as one of the main problems that prevent building an efficient model. The amount of missing data affects the model performance and produces biased estimates of predictions leading to unacceptable results [1]. The subsections discuss the types of missingness in data and the handling methods. Detecting the source of “missingness” is vital, as it affects the selection of the imputation method. Missing data occur in the medical field when: (i) the variable was measured, but for an unknown reason the values were not electronically written down, e.g., loss of sensors, errors in connecting with the database server, unintentional human forgetfulness, electricity decay, and others, (ii) the variable was unmeasured all over a quantity of time because of a Symmetry 2020, 12, 1594; doi:10.3390/sym12101594 www.mdpi.com/journal/symmetry

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call