Schema Normalization Research Articles

Functional dependencies (FDs) play a very important role in many data management tasks such as schema normalization, data cleaning, and query optimization. Meanwhile, there are ever-increasing application demands for efficient FD discovery on large-scale datasets. Unfortunately, due to huge runtime and memory overhead, the existing single-machine FD discovery algorithms are inefficient for large-scale datasets. Recently, distributed data-parallel computing has become the de facto standard for large-scale data processing. However, it is challenging to design an efficient distributed FD discovery algorithm. In this paper, we present SmartFD, which is an efficient and scalable algorithm for distributed FD discovery. First, we propose a novel attribute sorting-based algorithm framework. Next, to discover all the FDs grouped by a given attribute, we propose an efficient distributed algorithm Attribute-centric Functional Dependency Discovery (AFDD). In AFDD, we design an Fast Sampling and Early Aggregation (FSEA) mechanism to improve the efficiency of distributed sampling and propose a memory-efficient index-based method for distributed FD validation. Moreover, AFDD employs an attribute-parallel method to accelerate the pruning-and-generation of candidate FDs. Furthermore, we propose an adaptive switching strategy between distributed sampling and distributed validation based on the unified time-based efficiency metric. Also, we employ a distributed probing based method to make the switching strategy more accurate. Experimental results on Apache Spark reveal that SmartFD outperforms the state-of-the-art single-machine algorithm HyFD and the existing distributed algorithm HFDD with 3.2×–44.9× and 2.5×–455.7× speedup respectively. Moreover, SmartFD achieves good row scalability and column scalability. Additionally, SmartFD has sub-linear node scalability.

Read full abstract

Functional dependencies are important metadata used for schema normalization, data cleansing and many other tasks. The efficient discovery of functional dependencies in tables is a well-known challenge in database research and has seen several approaches. Because no comprehensive comparison between these algorithms exist at the time, it is hard to choose the best algorithm for a given dataset. In this experimental paper, we describe, evaluate, and compare the seven most cited and most important algorithms, all solving this same problem. First, we classify the algorithms into three different categories, explaining their commonalities. We then describe all algorithms with their main ideas. The descriptions provide additional details where the original papers were ambiguous or incomplete. Our evaluation of careful re-implementations of all algorithms spans a broad test space including synthetic and real-world data. We show that all functional dependency algorithms optimize for certain data characteristics and provide hints on when to choose which algorithm. In summary, however, all current approaches scale surprisingly poorly, showing potential for future research.

Read full abstract

Schema Normalization Research Articles

Related Topics

Articles published on Schema Normalization

Efficient and Scalable Functional Dependency Discovery on Distributed Data-Parallel Platforms

Functional dependency discovery

A virtual tutor for relational schema normalization

RDBNorma: - A semi-automated tool for relational database schema normalization up to third normal form

Schema Design and Normalization Algorithm for XML Databases Model

New Search and Navigation Techniques in the Digital Library

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Schema Normalization Research Articles

Related Topics

Articles published on Schema Normalization

Efficient and Scalable Functional Dependency Discovery on Distributed Data-Parallel Platforms

Functional dependency discovery

A virtual tutor for relational schema normalization

RDBNorma: - A semi-automated tool for relational database schema normalization up to third normal form

Schema Design and Normalization Algorithm for XML Databases Model

New Search and Navigation Techniques in the Digital Library