Mining high utility itemsets with time‐aware scheduling using Apache Spark

Anup Brahmavar,Geetha Maiya,Harish Venkatarama

doi:10.1002/cpe.7192

Abstract

SummarySince the last decade, Market Basket Analysis has been propelled by augmentation of revenue information. Termed as high utility itemset mining (HUIM), this task considers the factors of purchase quantity and unit profit of the items in the transaction database. Although several sequential algorithms to mine HUIs exist, their performance degrades as the database becomes voluminous. Distributed computing solutions such as Apache Hadoop and Apache Spark have proven effective in alleviating this bottleneck. In this regard, the current study develops a parallel workflow to adapt a single‐phase tree‐based algorithm called the single phase utility computation (SPUC) algorithm on a Spark cluster. Based on the time taken to mine individual conditional pattern bases in SPUC, an assignment strategy that partitions the search space across the cluster is proposed in parallel SPUC (PSPUC) algorithm. Experimental evaluation conducted using real and synthetic datasets demonstrate that PSPUC outperforms PHUI‐Growth algorithm. Apart from this, PSPUC in conjunction with the time‐aware assignment strategy converges mining faster than a random assignment of items. A linear speedup of PSPUC is also demonstrated.

Full Text