Abstract

We study the problem of missing data imputation, which is a fundamental task in the area of data quality that aims to impute the missing data to achieve the completeness of datasets. Though the recent distribution-modeling-based techniques (e.g., distribution generation and distribution matching) can achieve state-of-the-art performance in terms of imputation accuracy, we notice that (1) they deploy a sophisticated deep learning model that tends to be overfitting for missing data imputation; (2) they directly rely on a global data distribution while overlooking the local information. Driven by the inherent variability in both missing data and missing mechanisms, in this paper, we explore the uncertain nature of this task and aim to address the limitations of existing works by proposing an u<u>N</u>certainty-driven netw<u>O</u>rk for <u>M</u>issing data <u>I</u>mputation, termed NOMI. NOMI has three key components, i.e., the retrieval module, the neural network gaussian process imputator (NNGPI) and the uncertainty-based calibration module. NOMI~ runs these components sequentially and in an iterative manner to achieve a better imputation performance. Specifically, in the retrieval module, NOMI~ retrieves local neighbors of the incomplete data samples based on the pre-defined similarity metric. Subsequently, we design NNGPI~ that merges the advantages of both the Gaussian Process and the universal approximation capacity of neural networks. NNGPI~ models the uncertainty by learning the posterior distribution over the data to impute missing values while alleviating the overfitting issue. Moreover, we further propose an uncertainty-based calibration module that utilizes the uncertainty of the imputator on its prediction to help the retrieval module obtain more reliable local information, thereby further enhancing the imputation performance. We also demonstrate that our NOMI~ can be reformulated as an instance of the well-known Expectation Maximization (EM) algorithm, highlighting the strong theoretical foundation of our proposed methods. Extensive experiments are conducted over 12 real-world datasets. The results demonstrate the excellent performance of NOMI in terms of both accuracy and efficiency.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.