Abstract

We propose the first framework for discovering the set of meaningful functional dependencies from data. This set contains the true positives among the set of functional dependencies that hold on the given data. Based on new data structures and original techniques for the dynamic computation of stripped partitions, we devise a new hybridization strategy that results in the first algorithm that can explore trade-offs between runtime efficiency and main memory usage. Using real-world benchmark data, we demonstrate that our algorithm outperforms the previous state-of-the-art in terms of runtime efficiency, and scalability in the number of rows and columns. We propose the number of redundant data values for ranking the functional dependencies that have been discovered. Our ranking helps separate false from true positives for applications, such as schema design. The remaining meaningful functional dependencies consist of the false negatives, that is, those functional dependencies that are only violated by the given data due to data inconsistency. We propose the computation of informative Armstrong relations to draw the attention of users to violations of functional dependencies that are meaningful for some application. We order the pairs of records in Armstrong relations based on the amount of inconsistency and redundancy caused by the associated functional dependencies, thereby pointing the attention to those most likely to be meaningful. As we demonstrate, these samples help separate false from true negatives, their perfect recall of meaningful functional dependencies can lead to a more complete acquisition of requirements and identification of dirty data, and may be computed faster than covers of functional dependencies. In addition, we demonstrate for the first time that non-redundant covers can offer a representation of functional dependencies that is much smaller than left-hand side reduced covers used in previous work. Such a compact representation of the output is easier to understand and explore by humans. We report all our results for different interpretations of missing values.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call