Identifying insufficient data coverage in databases with multiple relations

Yin Lin,Abolfazl Asudeh,H V Jagadish,Yifan Guan

doi:10.14778/3407790.3407821

Abstract

In today's data-driven world, it is critical that we use appropriate datasets for analysis and decision-making. Datasets could be biased because they reflect existing inequalities in the world, due to the data scientists' biased world view, or due to the data scientists' limited control over the data collection process. For these reasons, it is important to ensure adequate data coverage across different groups over the intersection of multiple attributes. Often, the dataset to be analyzed is obtained through complex joins and predicate combinations over multiple relational tables in a database. Due to the sheer data volume we often have to deal with, determining adequate coverage can require an unacceptably long execution time. In this paper, we provide an efficient approach for coverage analysis, given a set of attributes across multiple tables. To identify regions with insufficient coverage in the combinatorially large set of value combinations, we design an index scheme to avoid explicit table joins, achieve efficient memory usage, and support predicate combination at a high level of parallelism. We also propose P-WALK , a priority-based search algorithm, to traverse the lattice space. Since in practice, coverage assessment typically does not require precise COUNT aggregation results, we further present approximate methods to reduce computation time. Experimental evaluation using three real-world datasets shows the effectiveness, efficiency, and accuracy of proposed methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Identifying insufficient data coverage in databases with multiple relations

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment

Lead the way for us

Journal: Proceedings of the VLDB Endowment	Publication Date: Aug 1, 2020
Citations: 30

Similar Papers

Mining relational patterns from multiple relational tables
Maytal Saar Tsechansky ... Avi Porath
Decision Support Systems | VOL. 27
Maytal Saar Tsechansky, et. al.Maytal Saar Tsechansky ... Avi Porath
01 Nov 1999
Decision Support Systems | VOL. 27

The Role of Declarative Languages in Mining Biological Databases
David Page
-
David PageDavid Page
16 Dec 2002
16 Dec 2002

Flexible Packet Matching with Single Double Cuckoo Hash
Gil Levy ... Pedro Reviriego
IEEE Communications Magazine | VOL. 55
Gil Levy, et. al.Gil Levy ... Pedro Reviriego
01 Jan 2017
IEEE Communications Magazine | VOL. 55

Spark
Yi Luo ... Wei Wang
-
Yi Luo, et. al.Yi Luo ... Wei Wang
11 Jun 2007
11 Jun 2007

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Identifying insufficient data coverage in databases with multiple relations

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment