Speeding up maximal fully-correlated itemsets search in large databases

Lian Duan,W Nick Street

doi:10.1007/s13042-014-0290-9

Abstract

Finding the most interesting correlations among items is essential for problems in many commercial, medical, and scientific domains. Our previous work on the maximal fully-correlated itemset (MFCI) framework can rule out the itemsets with irrelevant items and its downward-closed property helps to achieve good computational performance. However, to calculate the desired MFCIs in large databases, there are still two computational issues. First, unlike finding maximal frequent itemsets which can start the pruning from 1-itemsets, finding MFCIs must start the pruning from 2-itemsets. When the number of items in a given dataset is large and the support of all the pairs cannot be loaded into the memory, the IO cost (\(O(n^2)\)) for calculating correlation of all the pairs can be very high. Second, users usually need to try different correlation thresholds for different desirable MFCIs. Therefore, the cost of processing the Apriori procedure each time for a different correlation threshold is also very high. Consequently, we proposed two techniques to solve these problems. First, we identify the correlation upper bound for any good correlation measure to avoid unnecessary IO query for the support of pairs, and make use of their common monotone property to prune many pairs even without computing their correlation upper bounds. In addition, we build an enumeration tree to save the fully-correlated value for all the MFCIs under a given initial correlation threshold. We can either efficiently retrieve the desired MFCIs for any given threshold above the initial threshold or incrementally grow the tree if the given threshold is below the initial threshold. Experimental results show that our algorithm can be an order of magnitude faster than the original MFCI algorithm.

Full Text