Mining high average utility nonoverlapping patterns from sequential database
As a crucial aspect of data mining, high average utility sequential pattern mining (SPM) aims to discover low frequency and high average utility patterns (subsequences) in sequence data. Most existing high average utility SPM methods overlook the repetitive occurrences of patterns in each sequence, resulting in some important patterns being ignored. To address this issue, we focus on the problem of mining high average utility nonoverlapping patterns (HUPs) from sequential database, and propose an HUP-Miner algorithm. To reduce the need for repeated scanning of the original database, we use a position dictionary to record the occurrence information of each item. To reduce the number of candidate patterns generated, we adopt a pattern join strategy and explore four pruning strategies. To efficiently calculate the average utility of a pattern, we propose an SPC algorithm that utilizes the occurrence positions of sub-patterns. When compared with 12 competitive algorithms, the experimental results on 14 databases show that HUP-Miner gives superior results. Furthermore, we use information gain as the utility for each item, and find that the HUPs discovered in this way can generate better performance via a clustering analysis. All of the algorithms and databases used here are available from https://github.com/wuc567/Pattern-Mining/tree/master/HUP-Miner.
- Research Article
2
- 10.3390/app132212340
- Nov 15, 2023
- Applied Sciences
High-utility sequential pattern mining (HUSPM) helps researchers find all subsequences that have high utility in a quantitative sequential database. The HUSPM approach appears to be well suited for resource transformation in DIKWP graphs. However, all the extensions of a high-utility sequential pattern (HUSP) also have a high utility that increases with its length. Therefore, it is difficult to obtain diverse patterns of resources. The patterns that consist of many low-utility items can also be a HUSP. In practice, such a long pattern is difficult to analyze. In addition, the low-utility items do not always reflect the interestingness of association rules. High average-utility pattern mining is considered a solution to extract more significant patterns by considering the lengths of patterns. In this paper, we formulate the problem of top-k high average-utility sequential pattern mining (HAUSPM) and propose a novel algorithm for resource transformation. We adopt a projection mechanism to improve efficiency. We also adopt the sequence average-utility-raising strategy to increase thresholds. We design the prefix extension average utility and the reduced sequence average utility by incorporating the average utility into the utility upper bounds. The results of our comparative experiments demonstrate that the proposed algorithm can achieve sufficiently good performance.
- Research Article
99
- 10.1109/tcyb.2020.2970176
- Feb 28, 2020
- IEEE Transactions on Cybernetics
High-utility sequential pattern (HUSP) mining is an emerging topic in the field of knowledge discovery in databases. It consists of discovering subsequences that have a high utility (importance) in sequences, which can be referred to as HUSPs. HUSPs can be applied to many real-life applications, such as market basket analysis, e-commerce recommendations, click-stream analysis, and route planning. Several algorithms have been proposed to efficiently mine utility-based useful sequential patterns. However, due to the combinatorial explosion of the search space for low utility threshold and large-scale data, the performances of these algorithms are unsatisfactory in terms of runtime and memory usage. Hence, this article proposes an efficient algorithm for the task of HUSP mining, called HUSP mining with UL-list (HUSP-ULL). It utilizes a lexicographic q -sequence (LQS)-tree and a utility-linked (UL)-list structure to quickly discover HUSPs. Furthermore, two pruning strategies are introduced in HUSP-ULL to obtain tight upper bounds on the utility of the candidate sequences and reduce the search space by pruning unpromising candidates early. Substantial experiments on both real-life and synthetic datasets showed that HUSP-ULL can effectively and efficiently discover the complete set of HUSPs and that it outperforms the state-of-the-art algorithms.
- Research Article
- 10.1371/journal.pone.0283365
- Mar 29, 2023
- PloS one
High utility sequential pattern (HUSP) mining aims to mine actionable patterns with high utilities, widely applied in real-world learning scenarios such as market basket analysis, scenic route planning and click-stream analysis. The existing HUSP mining algorithms mainly attempt to improve computation efficiency while maintaining the algorithm stability in the setting of large-scale data. Although these methods have made some progress, they ignore the relationship between additional items and underlying sequences, which directly leads to the generation of redundant sequential patterns sharing the same underlying sequence. Hence, the mined patterns' actionability is limited, which significantly compromises the performance of patterns in real-world applications. To address this problem, we present a new method named Combined Utility-Association Sequential Pattern Mining (CUASPM) by incorporating item/sequence relations, which can effectively remove redundant patterns and extract high discriminative and strongly associated sequential pattern combinations with high utilities. Specifically, we introduce the concept of actionable combined mining into HUSP mining for the first time and develop a novel tree structure to select discriminative high utility sequential patterns (HUSPs) for downstream tasks. Furthermore, two efficient strategies (i.e., global and local strategies) are presented to facilitate mining HUSPs while guaranteeing utility growth and high levels of association. Last, two parameters are introduced to evaluate the interestingness of patterns to choose the most useful actionable combined HUSPs (ACHUSPs). Extensive experimental results demonstrate that the proposed CUASPM outperforms the baselines in terms of execution time, memory usage, mining high discriminative and strongly associated HUSPs.
- Research Article
52
- 10.1016/j.ins.2020.07.043
- Jul 19, 2020
- Information Sciences
Efficient list based mining of high average utility patterns with maximum average pruning strategies
- Research Article
29
- 10.1016/j.eswa.2018.03.019
- Mar 12, 2018
- Expert Systems with Applications
A pure array structure and parallel strategy for high-utility sequential pattern mining
- Research Article
1
- 10.3233/jifs-232107
- Nov 4, 2023
- Journal of Intelligent & Fuzzy Systems
In recent years, there has been an increasing demand for high utility sequential pattern (HUSP) mining. Different from high utility itemset mining, the “combinatorial explosion” problem of sequence data makes it more challenging. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods of HUSP from a novel perspective. Firstly, from the perspective of serial and parallel, the data structure used by the mining methods are illustrated and the pros and cons of the algorithms are summarized. In order to protect data privacy, many HUSP hiding algorithms have been proposed, which are classified into array-based, chain-based and matrix-based algorithms according to the key technologies. The hidden strategies and evaluation metrics adopted by the algorithms are summarized. Next, a taxonomy of the most common and the state-of-the-art approaches for incremental mining algorithms is presented, including tree-based and projection-based. In order to deal with the latest sequence in the data stream, the existing algorithms often use the window model to update dynamically, and the algorithms are divided into methods based on sliding windows and landmark windows for analysis. Afterwards, a summary of derived high utility sequential pattern is presented. Finally, aiming at the deficiencies of the existing HUSP research, the next work that the author plans to do is given.
- Research Article
27
- 10.1007/s10994-016-5617-1
- Feb 2, 2017
- Machine Learning
High utility sequential pattern (HUSP) mining has emerged as an important topic in data mining. A number of studies have been conducted on mining HUSPs, but they are mainly intended for non-streaming data and thus do not take data stream characteristics into consideration. Streaming data are fast changing, continuously generated unbounded in quantity. Such data can easily exhaust computer resources (e.g., memory) unless a proper resource-aware mining is performed. In this study, we explore the fundamental problem of how limited memory can be best utilized to produce high quality HUSPs over a data stream. We design an approximation algorithm, called MAHUSP, that employs memory adaptive mechanisms to use a bounded portion of memory, in order to efficiently discover HUSPs over data streams. An efficient tree structure, called MAS-Tree, is proposed to store potential HUSPs over a data stream. MAHUSP guarantees that all HUSPs are discovered in certain circumstances. Our experimental study shows that our algorithm can not only discover HUSPs over data streams efficiently, but also adapt to memory allocation with limited sacrifices in the quality of discovered HUSPs. Furthermore, in order to show the effectiveness and efficiency of MAHUSP in real-life applications, we apply our proposed algorithm to a web clickstream dataset obtained from a Canadian news portal to showcase users' reading behavior, and to a real biosequence database to identify disease-related gene regulation sequential patterns. The results show that MAHUSP effectively discovers useful and meaningful patterns in both cases.
- Conference Article
85
- 10.1109/icdm.2013.148
- Dec 1, 2013
High utility sequential pattern mining is an emerging topic in the data mining community. Compared to the classic frequent sequence mining, the utility framework provides more informative and actionable knowledge since the utility of a sequence indicates business value and impact. However, the introduction of "utility" makes the problem fundamentally different from the frequency-based pattern mining framework and brings about dramatic challenges. Although the existing high utility sequential pattern mining algorithms can discover all the patterns satisfying a given minimum utility, it is often difficult for users to set a proper minimum utility. A too small value may produce thousands of patterns, whereas a too big one may lead to no findings. In this paper, we propose a novel framework called top-k high utility sequential pattern mining to tackle this critical problem. Accordingly, an efficient algorithm, Top-k high Utility Sequence (TUS for short) mining, is designed to identify top-k high utility sequential patterns without minimum utility. In addition, three effective features are introduced to handle the efficiency problem, including two strategies for raising the threshold and one pruning for filtering unpromising items. Our experiments are conducted on both synthetic and real datasets. The results show that TUS incorporating the efficiency-enhanced strategies demonstrates impressive performance without missing any high utility sequential patterns.
- Conference Article
3
- 10.1109/ispa-bdcloud-sustaincom-socialcom48970.2019.00132
- Dec 1, 2019
High utility sequential pattern mining (HUSPM) is an emerging topic in data mining. Compared with the previous topics (sequential pattern mining and high utility itemset mining), HUSPM can provide more applicable knowledge, for it comprehensively considers utility indicating the business value and sequential indicating the causality of different items. However, the combination of utility and sequential brings the dramatic challenges and makes HUSPM more difficult than the previous problems. In this paper, we propose an two efficient algorithms, HUS-UT and HUS-Par, for HUSPM. The proposed HUS-UT algorithm adopts a novel data structure named Utility-Table to facilitate the utility calculation, so it can find the desired patterns quickly. The HUS-Par algorithm is a parallel version of HUS-UT based on the thread model, which also exploits two balance strategies to improve efficiency. We also conduct substantially experiments to evaluate the performance of our algorithms. The experimental results show that our algorithms are much faster than the state-of-the-art algorithms.
- Research Article
- 10.17762/msea.v70i1.2304
- Jan 31, 2021
- Mathematical Statistician and Engineering Applications
Text mining used on texts and publications in the biomedical and molecular biology fields is referred to as "biomedical text mining." It is a relatively new area of study at the intersection of computational linguistics, bioinformatics, and natural language processing. Superior usefulness the goal of sequential pattern mining is to identify statistically significant patterns among data instances when the values are presented sequentially. Time series mining is typically regarded as a distinct activity even if it is closely linked since it is typically assumed that the values are discrete. Structured data mining has a unique use known as sequential pattern mining. High utility pattern (HUP) mining is one of the most relevant study areas in data mining nowadays since it is capable of taking into consideration the nonbinary frequency values of items in transactions as well as different profit values for each item. The utilization of previous data structures as well as mining outcomes, yet, enables incremental and interactive data mining to eliminate the need for further calculations when a database is updated or the minimum threshold is modified. The method in this study suggests three innovative tree architectures for effective incremental and interactive HUP mining. The high utility sequential pattern mining issue has formalised key ideas and elements.
- Research Article
21
- 10.1016/j.protcy.2012.10.053
- Jan 1, 2012
- Procedia Technology
Efficiently Mining of Effective Web Traversal Patterns with Average Utility
- Research Article
34
- 10.1145/3178114
- Jun 1, 2018
- ACM Transactions on Intelligent Systems and Technology
High utility sequential pattern (HUSP) mining is an emerging topic in pattern mining, and only a few algorithms have been proposed to address it. In practice, most sequence databases usually grow over time, and it is inefficient for existing algorithms to mine HUSPs from scratch when databases grow with a small portion of updates. In view of this, we propose the IncUSP-Miner + algorithm to mine HUSPs incrementally. Specifically, to avoid redundant re-computations, we propose a tighter upper bound of the utility of a sequence, called Tight Sequence Utility (TSU), and then we design a novel data structure, called the candidate pattern tree, to buffer the sequences whose TSU values are greater than or equal to the minimum utility threshold in the original database. Accordingly, to avoid keeping a huge amount of utility information for each sequence, a set of concise utility information is designed to be stored in each tree node. To improve the mining efficiency, several strategies are proposed to reduce the amount of computation for utility update and the scopes of database scans. Moreover, several strategies are also proposed to properly adjust the candidate pattern tree for the support of multiple database updates. Experimental results on some real and synthetic datasets show that IncUSP-Miner + is able to efficiently mine HUSPs incrementally.
- Book Chapter
8
- 10.1007/978-94-017-8798-7_7
- Jan 1, 2014
High utility sequential pattern mining is to mine sequences with high utility (e.g. profits) but probably with low frequency. In some applications such as marketing analysis, high utility sequential patterns are usually more useful than sequential patterns with high frequency. In this paper, we devise two pruning strategies RSU and PDU, and propose HUS-Span algorithm based on these two pruning strategies to efficiently identify high utility sequential patterns. Experimental results show that HUS-Span algorithm outperforms prior algorithms by pruning more low utility sequences.
- Research Article
5
- 10.1142/s0218001418590176
- Jun 20, 2018
- International Journal of Pattern Recognition and Artificial Intelligence
High utility sequential patterns (HUSP) mining has recently received a lot of attention from researchers. Many algorithms have been proposed to mine HUSP and most of them only use a single minimum utility, which implicitly assumes that all items in the database are of the same importance (such as profit), or other information based on users’ concern in the database. This is often not the case in real-life applications. Although a few methods have been proposed to mine high utility itemsets (HUI) with multiple minimum utility (MMU), they are not suitable for mining HUSP with MMU because an item may occur more than one time in a sequence and may have multiple utility values. In this paper, we propose a novel method, called HUSpan-MMU, to efficiently mine HUSP with MMU from sequential utility-based databases. A lexicographic quantitative sequence tree (LQS-tree) is used to extract the complete set of HUSP. Meanwhile, two pruning methods are used to reduce the search space in the LQS-tree. Experimental results on both synthetic and real datasets show that HUSpan-MMU can efficiently mine HUSP with MMU from utility-based databases.
- Conference Article
1
- 10.1109/smc.2013.500
- Oct 1, 2013
Mining sequential patterns is to find the sequential purchasing behaviors for most of the customers, which only considers the number of the customers with the purchasing behaviors in a customer transaction database. Mining high utility sequential patterns considers both of the profits and purchased quantities for the items, which is to find the sequential patterns with high benefits for the business. The previous researches for mining high utility sequential patterns roughly defined the utility of a sequence contributed by a customer, such that the generated patterns are not really high utility. Moreover, the previous approaches need to generate a large number of the candidates and scan the whole database to calculate the utilities for all the generated candidates. Therefore, in this paper, we consider the actual purchasing behaviors for the customers and exactly define the high utility sequential patterns. Besides, we also propose an efficient algorithm for mining our well-defined high utility sequential patterns which can significantly reduce the number of the candidates. The experimental results also show that our algorithm significantly outperforms the previous approach for mining high utility sequential patterns.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.