Abstract

A ubiquitous challenge throughout all areas of data mining, particularly in the mining of frequent patterns in large databases, is centered on the necessity to reduce the time and space required to perform the search. The extent of this reduction proportionally facilitates the ability to identify patterns of interest. High utility sequential pattern mining (HUSPM) seeks to identify frequent patterns that are (1) sequential in nature and (2) hold a significant magnitude of utility in a sequence database, by considering the aspect of item value or importance. While traditional sequential pattern mining relies on the downward closure property to significantly reduce the required search space, with HUSPM, this property does not hold. To address this drawback, an approach is proposed that establishes a tight upper bound on the utility of future candidate sequential patterns by maintaining a list of items that are deemed potential candidates for concatenation. Such candidates are provably the only items that are ever needed for any extension of a given sequential pattern or its descendants in the search tree. This list is then exploited to significantly further tighten the upper bound on the utilities of descendent patterns. An extension of this work is then proposed that significantly reduces the computational cost of updating database utilities each time a candidate item is removed from the list, resulting in a massive reduction in the number of candidate sequential patterns that need to be generated in the search. Sequential pattern mining methods implementing these new techniques for bound reduction and further candidate list reduction are demonstrated via the introduction of the CRUSP and CRUSPPivot algorithms, respectively. Validation of the techniques was conducted on six public datasets. Tests show that use of the CRUSP algorithm results in a significant reduction in the overall number of candidate sequential patterns that need to be considered, and subsequently a significant reduction in run time, when compared to the current state of the art in bounding techniques. When employing the CRUSPPivot algorithm, the further reduction in the size of the search space was found to be dramatic, with the reduction in run time found to be dramatic to moderate, depending on the dataset. Demonstrating the practical significance of the work, experiments showed that time required for one particularly complex dataset was reduced from many hours to less than one minute.

Highlights

  • High utility sequential pattern mining (HUSPM) [1,2] is a subfield of frequent pattern mining [3]that assigns levels of relative magnitude or importance to objects with the goal of identifying more impactful patterns

  • Algorithm are tested, namely the maximum concatenation utility (MCU) method that maintains a list of items that are candidates for future concatenation, and the reduced concatenation utility (RCU) method that further reduces the upper bound on descendant pattern utilities by capitalizing on items having been removed from the candidate lists

  • Performance of these two approaches are compared to two state-of-the-art approaches from the literature, namely the sequence-weighted utility (SWU) method for determining upper bounds on candidate utilities as implemented by uSpan [2] and the reduced sequence utility (RSU) method implemented by HUS-Span [7]

Read more

Summary

Introduction

High utility sequential pattern mining (HUSPM) [1,2] is a subfield of frequent pattern mining [3]. Existing bound-based search pruning methods, namely the PEU and RSU approaches discussed above, are extended by a search technique that maintains a list of candidate concatenation items The use of this list has a significant impact on the search process since, for any sequential pattern sp under consideration in the search, only the items in the candidate list associated with sp need ever be considered for concatenation for any supersequence of sp with sp as prefix. A relaxed upper bound on the utility of all pattern extensions, referred to as the pivot-centered prefix extension utility (PPEU), is proposed While this value will always be greater than or equal to the PEU for a particular sequential pattern, seemingly rendering it less effective at pruning, it has the significant benefit that remaining utility values do not need to be maintained at all positions in the database.

Literature Review
Sequential Pattern Mining
High Utility Sequential Pattern Mining
Lexicographic Tree Search
Existing Pruning Strategies for HUSPM
PEU-Based Candidate Maintenance
The CRUSP Algorithm
Pivot-Centered PEU-Based Candidate Maintenance
The CRUSPPivot Algorithm
Objectives and Hypotheses
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call