Abstract
High-utility sequential pattern mining (HUSPM) is the task of discovering all sequential patterns in a sequence database whose utility values are equal to or greater than a given minimum utility threshold. HUSPM has become increasingly important in many real-world data mining applications, such as market basket data analysis, weblog mining, and bio-medical gene data analysis, which considers co-occurrence values and quantity, utility (e.g., profit or cost) and time. Current approaches in the literature for HUSPM use the utility matrix to store a sequence database in the memory. Unfortunately, the utility matrix consumes a large amount of main memory. To address this issue, we introduce a pure array structure that reduces the memory consumption when compared to the utility matrix. In addition, HUSPM is also challenged with the downward closure property (DCP) to prune the search space. Recently, HUSPM algorithms have used the upper bound of utility values as the DCP. However, it is usually higher than the actual utility of patterns. Thus, these algorithms may generate many candidate patterns. The large search space leads to poor performance due to excessive runtime and memory usage. One of the reasons is the number of candidate patterns is proportional to the number of requisite projected database scans for calculating their actual utilities. In this paper, we present a novel pruning strategy that efficiently prunes non-HUSPs and significantly reduces the search space compared to the state-of-the-art HUS-Span algorithm. Moreover, we propose a parallel strategy to speed up the mining process. Then, we propose two algorithms which are the pure Array structure for High-utility Sequential (AHUS) pattern mining and AHUS parallel mining (AHUS-P). The AHUS-P algorithm uses shared memory to parallelize the mining process. It concurrently identifies HUSPs based on the advantages of the multi-core processor architecture. The experimental results show that AHUS and AHUS-P can efficiently and effectively discover all HUSPs. Both the proposed algorithms outperform the state-of-the-art HUS-Span algorithm in terms of runtime, memory usage, and scalability for all experimental datasets.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have