Abstract

Functional dependencies (FDs) and candidate keys are essential for table decomposition,database normalization, and data cleansing. In this paper, we present FDTool, a commandline Python application to discover minimal FDs in tabular datasets and infer equivalent attributesets and candidate keys from them. The runtime and memory costs associated withseven published FD discovery algorithms are given with an overview of their theoretical foundations.Previous research establishes that FD_Mine is the most efficient FD discovery algorithm when appliedto datasets with many rows (> 100,000 rows) and few columns (< 14 columns). This putsit in a special position to rule mine clinical and demographic datasets, which often consistof long and narrow sets of participant records. The structure of FD_Mine is described andsupplemented with a formal proof of the equivalence pruning method used. FDTool is are-implementation of FD_Mine with additional features added to improve performance andautomate typical processes in database architecture. The experimental results of applyingFDTool to 13 datasets of different dimensions are summarized in terms of the number ofFDs checked, the number of FDs found, and the time it takes for the code to terminate. Wefind that the number of attributes in a dataset has a much greater effect on the runtime andmemory costs of FDTool than does row count. The last section explains in detail how theFDTool application can be accessed, executed, and further developed.

Highlights

  • Functional dependencies (FDs) are key to understanding how attributes in a database schema relate to one another

  • An FD X → Y asserts that the values of candidate X uniquely determine those of candidate Y (Yao et al, 2002)

  • FDTool provides the user with the minimal FDs, equivalent attribute sets and candidate keys mined from a dataset

Read more

Summary

Introduction

Functional dependencies (FDs) are key to understanding how attributes in a database schema relate to one another. Lattice traversal algorithms are the most effective on datasets with many rows, because their validation method6 operates on attribute sets as opposed to data (Papenbrock et al, 2015). Pruning rules check the validity of candidates not yet checked with FDs already discovered and those inferred from Armstrong’s Axioms (Yao & Hamilton, 2008).

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call