Efficient high-utility occupancy itemset mining algorithm on massive data

Jingxuan He,Xixian Han,Jinbao Wang,Kaiqi Zhang

doi:10.1016/j.eswa.2022.118329

Abstract

Mining interesting itemsets on massive data is a necessary topic in data mining. Nowadays, most studies use frequency or utility as primary measure. However, using these two measures individually has its own limitations. For example, itemsets with high frequencies may have low profits while itemsets with high utilities perhaps appear occasionally, so they might be misleading. In addition, the existing algorithms can only deal with small-medium scale database, and their performances degrade significantly when data is expanded. To address these drawbacks, this paper proposes a novel high utility occupancy itemset mining algorithm SHO (Suffix-based High-utility Occupancy itemset mining), it considers both quantities and profits of itemsets. SHO designs the algorithm from suffix-based partitioning, generation pruning and itemsets linking, it can mine high utility occupancy itemsets on large-scale database effectively. At the beginning, the database are divided into some non-overlapping suffix-based partitions and stored in vertical format, then the support and utility occupancy of itemset can be calculated in a certain partition instead of traversing total database. Besides, two optimization strategies and four pruning strategies are proposed to make SHO faster. The extensive experiments show that SHO is much better than the current state-of-the-art algorithm, the efficiency can be improved up to 3 orders of magnitude.

Full Text