Abstract
Current Conditional Functional Dependency (CFD) discovery algorithms always need a well-prepared training dataset. This condition makes them difficult to apply on large and low-quality datasets. To handle the volume issue of big data, we develop the sampling algorithms to obtain a small representative training set. We design the fault-tolerant rule discovery and conflict-resolution algorithms to address the low-quality issue of big data. We also propose parameter selection strategy to ensure the effectiveness of CFD discovery algorithms. Experimental results demonstrate that our method can discover effective CFD rules on billion-tuple data within a reasonable period.
Highlights
With the accumulation of data at present, databases have become increasingly large
(3) The time of cleaning data with discovered Conditional Functional Dependency (CFD). (4) The quality of data cleaned by discovered CFDs is measured by the percentage of data cleaned according to the CFD sets discovered by our approach on dirty data and those obtained from the clean data
In Refs. [11, 14], for centralized storing relational databases, the approaches are designed to detect the tuples in violation of CFDs and Conditional Inclusion Dependencies (CINDs) automatically based on Structured Query Language (SQL) query processing
Summary
With the accumulation of data at present, databases have become increasingly large. At the same time, due to the difficulty in manual maintenance and variations of data sources, big data involves a high possibility of quality problems which make them difficult to use. [2] discover high-quality rules with data mining algorithms on a small but clean dataset efficiently. Developing a scalable method is necessary to mine high-quality rules from big data with size larger than the main memory To achieve this goal, we design a scalable and systemic algorithm. Mingda Li et al.: Mining Conditional Functional Dependency Rules on Big Data dataset is larger than the memory Another purpose of sampling is to filter dirty items and keep clean ones. The developed rule discovery method that is suitable for big data with size larger than the memory requires the following features, which the existing methods do not have:. We propose a method for discovering a high-quality CFD set Such an approach could tolerate data-quality problems and meet user requirements for a dataset with size larger than the memory.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.