Review on data imputation methods in machine learning

Jianing Xue

doi:10.1088/1742-6596/2646/1/012034

Abstract

Data is an important element in the analysis of machine learning. It is usually measured based on observations and is also an indispensable element in training a model. Good preparation of data helps enhance the performance of analysis and is able to deliver reliable final results. However, lots of factors influence the dataset and some lead to the loss of some data. When some portion of the data is missing, it causes biases in the final prediction outcomes. In order to minimize the consequences of missing data, several data imputation methods are established to solve the problem. This paper will first talk about some basic concepts about missing data. In the following sections, the paper will present several popular data imputation methods, including complete case analysis, single imputation, and multiple imputations. Applications of some methods will be presented to see how they can be used in real analysis situations. Finally, the paper will talk about the limits of current data imputation methods.

Full Text