Abstract

In today's data-driven world, it is critical that we use appropriate datasets for analysis and decision-making. Datasets could be biased because they reflect existing inequalities in the world, due to the data scientists' biased world view, or due to the data scientists' limited control over the data collection process. For these reasons, it is important to ensure adequate data coverage across different groups over the intersection of multiple attributes. Often, the dataset to be analyzed is obtained through complex joins and predicate combinations over multiple relational tables in a database. Due to the sheer data volume we often have to deal with, determining adequate coverage can require an unacceptably long execution time. In this paper, we provide an efficient approach for coverage analysis, given a set of attributes across multiple tables. To identify regions with insufficient coverage in the combinatorially large set of value combinations, we design an index scheme to avoid explicit table joins, achieve efficient memory usage, and support predicate combination at a high level of parallelism. We also propose P-WALK , a priority-based search algorithm, to traverse the lattice space. Since in practice, coverage assessment typically does not require precise COUNT aggregation results, we further present approximate methods to reduce computation time. Experimental evaluation using three real-world datasets shows the effectiveness, efficiency, and accuracy of proposed methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call