Abstract

Conditional functional dependencies (CFDs) generalize functional dependencies and lend themselves to wide applicability. CFDs on data are usually unknown and too costly to be designed manually. To this end, CFD discovery methods are studied for discovering hidden CFDs from data. In the setting of data cleaning, only a small number of CFDs are used to detect and repair errors, while common CFD discovery methods find all CFDs (approximately) holding on data, and an expensive post-processing step is further required for selecting those relevant ones. In this paper, we present an approach to discover CFDs that can detect errors in data, guided by a small set of erroneous attribute values labeled by users. (1) We present a method that consists of several modules of data sampling, CFD discovery and refinement guided by the user labeling and data re-sampling guided by the discovered CFDs, working in an iterative way. (2) We present novel efficient techniques to facilitate our approach, aiming at identifying CFDs useful for cleaning and reducing user interactions. (3) We conduct extensive experimental evaluations to verify our approach, against the state-of-the-art CFD discovery algorithms with or without user interactions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call