An Explicit Relationship Between Sequential Patterns and Their Concise Representations

Hai Duong,Tin Truong,Philippe Fournier-Viger,Bac Le

doi:10.1007/978-3-030-37188-3_20

Abstract

Mining sequential patterns in a sequence database (SDB) is an important and useful data mining task. Most existing algorithms for performing this task directly mine the set \( {\mathcal{F}\mathcal{S}} \) of all frequent sequences in an SDB. However, these algorithms often exhibit poor performance on large SDBs due to the enormous search space and cardinality of \( {\mathcal{F}\mathcal{S}} \). In addition, constraint-based mining algorithms relying on this approach must read an SDB again when a constraint is changed by the user. To address this issue, this paper proposes a novel approach for generating \( {\mathcal{F}\mathcal{S}} \) from the two sets of frequent closed sequences \( \left( {{\mathcal{F}\mathcal{C}\mathcal{S}}} \right) \) and frequent generator sequences \( ({\mathcal{F}\mathcal{G}\mathcal{S}}) \), which are concise representations of \( {\mathcal{F}\mathcal{S}} \). The proposed approach is based on a novel explicit relationship between \( {\mathcal{F}\mathcal{S}} \) and these two sets. This relationship is the theoretical basis for a novel efficient algorithm named GFS-CR that directly enumerates \( {\mathcal{F}\mathcal{S}} \) from \( {\mathcal{F}\mathcal{C}\mathcal{S}} \) and \( {\mathcal{F}\mathcal{G}\mathcal{S}} \) rather than mining them from an SDB. Experimental results show that GFS-CR outperforms state-of-the-art algorithms in terms of runtime and scalability.

Full Text