Abstract

Association rule discovery from large databases is one of the most challenging tasks in data mining. The process of frequent itemset mining, the first step in the mining of association rules, is a computational and I/O intensive process necessitating repeated passes over the entire database. Sampling has often been suggested as an effectual tool to reduce the size of the dataset operated at some cost to accuracy. Data mining literature presents with numerous sampling based approaches to speed up the process of Association Rule Mining (ARM). In our earlier research [29], we presented a proficient progressive sampling-based approach for mining association rules from massive databases. In this article, we validate our earlier approach with different empirical variations and also present an analysis on the validations using synthetic datasets. The approach starts with an initial sample selection process based on the temporal characteristics and size of the database. Subsequently, the frequent itemsets and the negative border are mined from the initial sample using Apriori algorithm. The patterns in the negative border are then sorted based on their support and the midpoint itemset in the sorted negative border is scanned in different variations (sizes) of the database to check its frequency. If the support of the midpoint itemset is greater than the support threshold, the sample size is progressively increased to a larger size. The aforesaid process is repeated until an optimal sample size is met and then association rules are mined from the optimal sample determined. The empirical validation also results the appropriate database size for conducting the midpoint itemset scan.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call