FastMFDs: a fast, efficient algorithm for mining minimal functional dependencies from large-scale distributed data with Spark

Feng Cheng,Zhe Yang

doi:10.1007/s11227-018-2643-8

Abstract

Minimal functional dependency is an important relationship in the relational database. It can describe some special relationships between complex and irregular attributes in the relational database. Extracting minimal functional dependencies (MFDs) from relational databases is an important database analysis technique. However, as the data grows larger and larger in size, even the most efficient stand-alone algorithms are exponential in the number of attributes of the relations. Discovering MFDs on a single computer is hard and slow, and it can only be applied to small centralized datasets. It is challenging to discover MFDs from big data, especially large-scale distributed data. Apache Spark is a unified analytics engine for big data processing; we present a new algorithm FastMFDs based on Spark for discovering all MFDs from large-scale distributed data in parallel. FastMFDs uses both the RDD framework and the DataFrame framework to store and process distributed data. FastMFDs deletes equivalent attributes. FastMFDs also provides two-way search algorithm for searching and pruning. We experimented our algorithm on real-life datasets, and our algorithm is more efficient and faster than the existing discovering methods.

Full Text