FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data.

Matt Buranosky,Elmar Stellnberger,Cavin Ward-Caviness,Emily Pfaff,David Diaz-Sanchez

doi:10.12688/f1000research.16483.2

Matt Buranosky, Elmar Stellnberger + Show 3 more

Open Access

https://doi.org/10.12688/f1000research.16483.2

Copy DOI

Abstract

Functional dependencies (FDs) and candidate keys are essential for table decomposition,database normalization, and data cleansing. In this paper, we present FDTool, a commandline Python application to discover minimal FDs in tabular datasets and infer equivalent attributesets and candidate keys from them. The runtime and memory costs associated withseven published FD discovery algorithms are given with an overview of their theoretical foundations.Previous research establishes that FD_Mine is the most efficient FD discovery algorithm when appliedto datasets with many rows (> 100,000 rows) and few columns (< 14 columns). This putsit in a special position to rule mine clinical and demographic datasets, which often consistof long and narrow sets of participant records. The structure of FD_Mine is described andsupplemented with a formal proof of the equivalence pruning method used. FDTool is are-implementation of FD_Mine with additional features added to improve performance andautomate typical processes in database architecture. The experimental results of applyingFDTool to 13 datasets of different dimensions are summarized in terms of the number ofFDs checked, the number of FDs found, and the time it takes for the code to terminate. Wefind that the number of attributes in a dataset has a much greater effect on the runtime andmemory costs of FDTool than does row count. The last section explains in detail how theFDTool application can be accessed, executed, and further developed.

Highlights

Functional dependencies (FDs) are key to understanding how attributes in a database schema relate to one another
An FD X → Y asserts that the values of candidate X uniquely determine those of candidate Y (Yao et al, 2002)
FDTool provides the user with the minimal FDs, equivalent attribute sets and candidate keys mined from a dataset

Summary

Introduction

Functional dependencies (FDs) are key to understanding how attributes in a database schema relate to one another. Lattice traversal algorithms are the most effective on datasets with many rows, because their validation method6 operates on attribute sets as opposed to data (Papenbrock et al, 2015). Pruning rules check the validity of candidates not yet checked with FDs already discovered and those inferred from Armstrong’s Axioms (Yao & Hamilton, 2008).

Results

Conclusion