S-FPG: A parallel version of FP-Growth algorithm under Apache Spark™

Aissatou Diaby Dite Gassama,Fode Camara,Samba Ndiaye

doi:10.1109/icccbda.2017.7951891

Abstract

Frequent Itemsets Mining (FIM) is an essential data mining task, with many real world applications such as market basket analysis, outlier detection, and so one. Many efficient single-node FIM algorithms such as the well-known FP-Growth algorithm have been proposed in the last two decades. However, as large-scale datasets are usually adopted nowadays, these algorithms become inefficient to mine frequent itemsets over big data. Scalable parallel algorithms hold the key to solving the problem in this context. However, the existing parallel versions of FP-Growth algorithm implemented with the disk-based MapReduce model are not efficient enough for iterative computation. In this paper, we propose an implementation of scalable parallel FP-Growth using the inmemory parallel computing framework Apache Spark™. Our experimental results demonstrated that the proposed algorithm can scale well and efficiently process large datasets.

Full Text