Abstract

Current Conditional Functional Dependency (CFD) discovery algorithms always need a well-prepared training dataset. This condition makes them difficult to apply on large and low-quality datasets. To handle the volume issue of big data, we develop the sampling algorithms to obtain a small representative training set. We design the fault-tolerant rule discovery and conflict-resolution algorithms to address the low-quality issue of big data. We also propose parameter selection strategy to ensure the effectiveness of CFD discovery algorithms. Experimental results demonstrate that our method can discover effective CFD rules on billion-tuple data within a reasonable period.

Highlights

  • With the accumulation of data at present, databases have become increasingly large

  • (3) The time of cleaning data with discovered Conditional Functional Dependency (CFD). (4) The quality of data cleaned by discovered CFDs is measured by the percentage of data cleaned according to the CFD sets discovered by our approach on dirty data and those obtained from the clean data

  • In Refs. [11, 14], for centralized storing relational databases, the approaches are designed to detect the tuples in violation of CFDs and Conditional Inclusion Dependencies (CINDs) automatically based on Structured Query Language (SQL) query processing

Read more

Summary

Introduction

With the accumulation of data at present, databases have become increasingly large. At the same time, due to the difficulty in manual maintenance and variations of data sources, big data involves a high possibility of quality problems which make them difficult to use. [2] discover high-quality rules with data mining algorithms on a small but clean dataset efficiently. Developing a scalable method is necessary to mine high-quality rules from big data with size larger than the main memory To achieve this goal, we design a scalable and systemic algorithm. Mingda Li et al.: Mining Conditional Functional Dependency Rules on Big Data dataset is larger than the memory Another purpose of sampling is to filter dirty items and keep clean ones. The developed rule discovery method that is suitable for big data with size larger than the memory requires the following features, which the existing methods do not have:. We propose a method for discovering a high-quality CFD set Such an approach could tolerate data-quality problems and meet user requirements for a dataset with size larger than the memory.

Background
Problem definition
Framework
Multiple-pass scan algorithm
Tuple section criteria
One-pass sampling algorithm
DFCFD algorithm
Dealing with conflicts between CFDs
Calculating the weight of each node
Discovery of the conflict between two CFDs
Parameter Selection
Experimental settings
Performance and scalability experiments
Optimality of parameters
Test on real data
Related Work
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.