On a parallel spark workflow for frequent itemset mining based on array prefix‐tree

Xinzheng Niu,Peng Wu,Mideng Qian,Aiqin Hou,Chase Q Wu

doi:10.1002/cpe.6313

Xinzheng Niu, Peng Wu + Show 3 more

https://doi.org/10.1002/cpe.6313

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

AbstractExtracting frequent itemsets from datasets is an important problem in data mining, for which several mining methods including FP‐Growth have been proposed. FP‐Growth is a classical frequent itemset mining method, which generates pattern databases without candidates. Many improvements have been made in the literature due to the high time complexity and memory usage of FP‐Growth. However, most of them still suffer from performance issues on large datasets. In this paper, we design an auxiliary structure, Array Prefix‐Tree (AP‐Tree), and propose a new algorithm, Array Prefix‐Tree Growth (APT‐Growth), which is further parallelized as a Spark workflow, referred to as PAPT‐Growth. Based on a density threshold, we incorporate an adaptive algorithm selection process into PAPT‐Growth to ensure its running time performance. We conduct extensive experiments on different thresholds and multiple datasets, and experimental results show the performance superiority of PAPT‐Growth in comparison with several state‐of‐the‐art methods such as PFP, YAFIM, and DFPS. The analysis on density reveals a changing point, which justifies the necessity and validity of adaptive algorithm selection.

Full Text