Areas Of Data Quality Research Articles

We study the problem of missing data imputation, which is a fundamental task in the area of data quality that aims to impute the missing data to achieve the completeness of datasets. Though the recent distribution-modeling-based techniques (e.g., distribution generation and distribution matching) can achieve state-of-the-art performance in terms of imputation accuracy, we notice that (1) they deploy a sophisticated deep learning model that tends to be overfitting for missing data imputation; (2) they directly rely on a global data distribution while overlooking the local information. Driven by the inherent variability in both missing data and missing mechanisms, in this paper, we explore the uncertain nature of this task and aim to address the limitations of existing works by proposing an uNcertainty-driven netwOrk for Missing data Imputation, termed NOMI. NOMI has three key components, i.e., the retrieval module, the neural network gaussian process imputator (NNGPI) and the uncertainty-based calibration module. NOMI~ runs these components sequentially and in an iterative manner to achieve a better imputation performance. Specifically, in the retrieval module, NOMI~ retrieves local neighbors of the incomplete data samples based on the pre-defined similarity metric. Subsequently, we design NNGPI~ that merges the advantages of both the Gaussian Process and the universal approximation capacity of neural networks. NNGPI~ models the uncertainty by learning the posterior distribution over the data to impute missing values while alleviating the overfitting issue. Moreover, we further propose an uncertainty-based calibration module that utilizes the uncertainty of the imputator on its prediction to help the retrieval module obtain more reliable local information, thereby further enhancing the imputation performance. We also demonstrate that our NOMI~ can be reformulated as an instance of the well-known Expectation Maximization (EM) algorithm, highlighting the strong theoretical foundation of our proposed methods. Extensive experiments are conducted over 12 real-world datasets. The results demonstrate the excellent performance of NOMI in terms of both accuracy and efficiency.

Read full abstract

Incompleteness management has become a popular research topic and been viewed in many applications in the area of data quality and data management. Traditional methods for handling incompleteness assume data is totally complete or incomplete. However, in practical applications, data is often partial complete, which means that data is not totally complete but some special parts of the data satisfying given semantic specifications are complete. Intuitively, partial complete data can still give complete answers for queries consistent with the semantic specifications. Therefore, it is highly needed to study the fundamental problems for managing partial complete data. However, as far as known by us, there are only few works focusing on this area. The most important and fundamental problem, completeness reasoning, is studied from the aspect of parameterized complexity by this paper. The completeness reasoning problem, TC-QC (Table Completeness to Query Completeness), is first formally defined and studied by Razniewski et al. [1]. Given completeness statements of data, the goal of the TC-QC problem is to determine whether the result of a special query Q is complete, that is to reason query completeness based on given data completeness. Razniewski et al. have shown that the TC-QC problem is NP-hard even for conjunctive queries, and a natural and interesting question is whether or not TC-QC can be solved efficiently by parameterized algorithms. To answer that, the parameterized complexities of completeness reasoning for conjunctive queries are studied by the paper. First, it is shown that, considering the parameterizations defined by the size of query completeness or table completeness, the parameterized TC-QC problem for conjunctive queries is para-NP-complete, which strongly indicate that the TC-QC problems parameterized by the above two parameters do not admit fixed-parameter tractable algorithms. Then, for more special cases parameterized by different constraints on query structures like degree, tree-width and number of variables, the TC-QC problems are still not fixed parameter tractable. Finally, on the positive side, if each data completeness statement has a constant size bound, the parameterized TC-QC problem defined by the query completeness size can be solved by a fixed-parameter tractable algorithm.

Read full abstract

Areas Of Data Quality Research Articles

Related Topics

Articles published on Areas Of Data Quality

Missing Data Imputation with Uncertainty-Driven Network

Overcoming diagnostic challenges of artificial intelligence in pathology and radiology: Innovative solutions and strategies

Parameterized complexity of completeness reasoning for conjunctive queries

Protocol Update for large-scale genome and gene function analysis with the PANTHER classification system (v.14.0).

Completeness and Accuracy of Local Clinical Registry Data for Children Undergoing Heart Surgery

Community Engagement among the BioSense 2.0 User Group

Statistical challenges in systematic evidence generation through analysis of observational healthcare data networks

MULTIPLE REMOVAL SUCCESS IN THE CARNARVON BASIN WITH SRME

Report on the Dagstuhl Seminar

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Areas Of Data Quality Research Articles

Related Topics

Articles published on Areas Of Data Quality

Missing Data Imputation with Uncertainty-Driven Network

Overcoming diagnostic challenges of artificial intelligence in pathology and radiology: Innovative solutions and strategies

Parameterized complexity of completeness reasoning for conjunctive queries

Protocol Update for large-scale genome and gene function analysis with the PANTHER classification system (v.14.0).

Completeness and Accuracy of Local Clinical Registry Data for Children Undergoing Heart Surgery

Community Engagement among the BioSense 2.0 User Group

Statistical challenges in systematic evidence generation through analysis of observational healthcare data networks

MULTIPLE REMOVAL SUCCESS IN THE CARNARVON BASIN WITH SRME

Report on the Dagstuhl Seminar