Abstract

Gram-based vector space model has been extensively applied to categorical sequence clustering. However, there is a general lack of an efficient method to determine the length of grams and to identify redundant and non-significant grams involved in the model. In this paper, a variable-length gram model is proposed, different from previous studies mainly focused on the fixed-length grams of sequences. The variable-length grams are obtained using a two-stage pruning method aimed at selecting the irredundant and significant subsequences from the prefix trees, created from the fixed-length grams with an initially large length. A robust partitioning algorithm is then defined for categorical sequence clustering on the normalized representation model using variable-length grams collected from the pruned trees. Experimental results on real-world sequence sets from various domains are given to demonstrate the performance of the proposed methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call