Mining Sequential Patterns with VC-Dimension and Rademacher Complexity

Diego Santoro,Fabio Vandin,Andrea Tonon

doi:10.3390/a13050123

Abstract

Sequential pattern mining is a fundamental data mining task with application in several domains. We study two variants of this task—the first is the extraction of frequent sequential patterns, whose frequency in a dataset of sequential transactions is higher than a user-provided threshold; the second is the mining of true frequent sequential patterns, which appear with probability above a user-defined threshold in transactions drawn from the generative process underlying the data. We present the first sampling-based algorithm to mine, with high confidence, a rigorous approximation of the frequent sequential patterns from massive datasets. We also present the first algorithms to mine approximations of the true frequent sequential patterns with rigorous guarantees on the quality of the output. Our algorithms are based on novel applications of Vapnik-Chervonenkis dimension and Rademacher complexity, advanced tools from statistical learning theory, to sequential pattern mining. Our extensive experimental evaluation shows that our algorithms provide high-quality approximations for both problems we consider.

Highlights

Sequential pattern mining [1] is a fundamental task in data mining and knowledge discovery, with applications in several fields, from recommender systems and e-commerce to biology and medicine
We studied two tasks related to sequential pattern mining: frequent sequential pattern mining and true frequent sequential pattern mining
We devised an efficient sampling-based algorithm to approximate the set of frequent sequential patterns in large datasets using the concept of VC-dimension

Summary

Introduction

Sequential pattern mining [1] is a fundamental task in data mining and knowledge discovery, with applications in several fields, from recommender systems and e-commerce to biology and medicine. The exact solution of the problem requires processing the entire dataset at least once, and often multiple times. A natural solution to reduce the computation is to use sampling to obtain a small random portion (sample) of the dataset, and perform the mining process only on the sample. It is easy to see that by analyzing only a sample of the data the problem cannot be solved exactly, and one has to rely on the approximation provided by the results of the mining task on the sample. The main challenge in using sampling is on computing a sample size such that the frequency of the sequential patterns in the sample is close to the frequency that would be obtained from the analysis on the whole dataset. Relating the two quantities using standard techniques (e.g., Hoeffding inequality and union bounds) does not provide useful results, Algorithms 2020, 13, 123; doi:10.3390/a13050123 www.mdpi.com/journal/algorithms

Objectives

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms	Publication Date: May 18, 2020
Citations: 11	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Mining Sequential Patterns with VC-Dimension and Rademacher Complexity

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms

Lead the way for us

Similar Papers

An efficient model for information gain of sequential pattern from web logs based on dynamic weight constraint
Dhirendra Kumar Jha ... Archana Tomar
-
Dhirendra Kumar Jha, et. al.Dhirendra Kumar Jha ... Archana Tomar
01 Oct 2010
01 Oct 2010

An efficient info-gain algorithm for finding frequent sequential traversal patterns from web logs based on dynamic weight constraint
Rahul Moriwal ... Vijay Prakash
-
Rahul Moriwal, et. al.Rahul Moriwal ... Vijay Prakash
03 Sep 2012
03 Sep 2012

Sequential Patterns Mining with Fuzzy Time-Intervals
Chung-I Chang ... Nancy P Lin
-
Chung-I Chang, et. al.Chung-I Chang ... Nancy P Lin
01 Jan 2009
01 Jan 2009

Design and Implementation of an Algorithm for Finding Frequent Sequential Traversal Patterns from Web Logs Based on Weight Constraint
Mahendra Singh Sisodia ... Rajesh K Nigam
-
Mahendra Singh Sisodia, et. al.Mahendra Singh Sisodia ... Rajesh K Nigam
01 Jan 2009
01 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Mining Sequential Patterns with VC-Dimension and Rademacher Complexity

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms