We study the problem of missing data imputation, which is a fundamental task in the area of data quality that aims to impute the missing data to achieve the completeness of datasets. Though the recent distribution-modeling-based techniques (e.g., distribution generation and distribution matching) can achieve state-of-the-art performance in terms of imputation accuracy, we notice that (1) they deploy a sophisticated deep learning model that tends to be overfitting for missing data imputation; (2) they directly rely on a global data distribution while overlooking the local information. Driven by the inherent variability in both missing data and missing mechanisms, in this paper, we explore the uncertain nature of this task and aim to address the limitations of existing works by proposing an u<u>N</u>certainty-driven netw<u>O</u>rk for <u>M</u>issing data <u>I</u>mputation, termed NOMI. NOMI has three key components, i.e., the retrieval module, the neural network gaussian process imputator (NNGPI) and the uncertainty-based calibration module. NOMI~ runs these components sequentially and in an iterative manner to achieve a better imputation performance. Specifically, in the retrieval module, NOMI~ retrieves local neighbors of the incomplete data samples based on the pre-defined similarity metric. Subsequently, we design NNGPI~ that merges the advantages of both the Gaussian Process and the universal approximation capacity of neural networks. NNGPI~ models the uncertainty by learning the posterior distribution over the data to impute missing values while alleviating the overfitting issue. Moreover, we further propose an uncertainty-based calibration module that utilizes the uncertainty of the imputator on its prediction to help the retrieval module obtain more reliable local information, thereby further enhancing the imputation performance. We also demonstrate that our NOMI~ can be reformulated as an instance of the well-known Expectation Maximization (EM) algorithm, highlighting the strong theoretical foundation of our proposed methods. Extensive experiments are conducted over 12 real-world datasets. The results demonstrate the excellent performance of NOMI in terms of both accuracy and efficiency.
Read full abstract