Functional dependencies (FDs) are among the most important integrity constraints in databases. They serve to normalize datasets and thus resolve redundancies, they contribute to query optimization, and they are frequently used to guide data cleaning efforts. Because the FDs of a particular dataset are usually unknown, automatic profiling algorithms are needed to discover them. These algorithms have made considerable advances in the past few years, but they still require a significant amount of time and memory to process datasets of practically relevant sizes. We present FDHits, a novel FD discovery algorithm that finds all valid, minimal FDs in a given relational dataset. FDHits is based on several discovery optimizations that include a hybrid validation approach, effective hitting set enumeration techniques, one-pass candidate validations, and parallelization. Our experiments show that FDHits, even without parallel execution, has a median speedup of 8.1 compared to state-of-the-art FD discovery algorithms while using significantly less memory. This allows the discovery of all FDs even on datasets that could not be processed by the current state-of-the-art.
Read full abstract