Abstract
The problem of exploiting Closed Sequential Patterns (CSPs) is an essential task in data mining, with many different applications. It is used to resolve the situations of huge databases or low minimum support (minsup) thresholds in mining sequential patterns. However, it is challenging and needs a lot of time to customize the minsup values for generating appropriate numbers of CSPs desired by users. To conquer this issue, the TSP algorithm for mining top-k CSPs was previously proposed, with k being a given parameter. The algorithm would return the k CSPs which have the highest support values in a database. However, its execution time and memory usage were high. In this paper, an algorithm named TKCS (Top-K Closed Sequences) is proposed to mine the top-k CSPs efficiently. To improve the execution time and memory usage, it uses a vertical bitmap database to represent data. Besides, it adopts some useful strategies in the process of exploiting the top-k CSPs such as: always choosing the sequential patterns with the greatest support values for generating candidate patterns and storing top-k CSPs in an ascending order of the support values to increase the minsup value more quickly. The empirical results show that TKCS has better performance than TSP for discovering the top-k CSPs in terms of both runtime and memory usage.
Highlights
In the domain of data mining from a sequence database, exploiting sequential patterns is an essential task that has been extensively examined [1], [3], [4], [8]–[11], [14], [17], [23], [27], [35]
RUNTIME Figures 2-7 shown the runtimes for the TKCS algorithm and TSP algorithm for mining the top-k Closed Sequential Patterns (CSPs). Based on these experimental results, the runtime of the TKCS algorithm is much faster than that of the TSP algorithm in all the databases and with different numbers of k-CSPs, especially when the user chooses a larger number for k and there is a large sequence database with many items
The top-k CSPs problem is explored in this work by increasing the minsup values to fit the sequential-pattern mining algorithms in order to generate the exact amount of CSPs desired by the user
Summary
In the domain of data mining from a sequence database, exploiting sequential patterns is an essential task that has been extensively examined [1], [3], [4], [8]–[11], [14], [17], [23], [27], [35]. AprioriAll [1] was the first algorithm designed to solve the sequential pattern mining problem It was proposed by Agrawal et al in 1995 and is the basis for later algorithms such as GSP [27], SPADE [35], SPAM [3], FREESPAN [12], PREFIXSPAN [23], PRISM [11], and MCM-SPADE [14]. The algorithms for exploiting sequential patterns or CSPs from a sequence database mentioned above always require a minimum support threshold by the user. The problem of increasing the minsup value to fit sequential-pattern mining algorithms is to automatically adjust the parameter to generate the exact amount of CSPs desired by users.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have