In this manuscript, we develop clustering and classification algorithms for Context trees arising from Variable Length Markov Chains (VLMC). The Context is defined as the finite suffix of the past that is sufficient to predict the next state of the chain. Defining relevant Contexts through the VLMC fitting procedure allows the Contexts to have different lengths depending on the past itself and can be described by a rooted tree. This type of parsimonious model relaxes the assumptions of higher order Markov Chains, whose number of parameters increases exponentially with the order of the chain. Dissimilarity measures that consider both the VLMC tree structure and the transition probability distributions of Contexts are derived and integrated into the procedures. Through simulations in a variety of scenarios, the proposed algorithms are shown to outperform classical competitors in both classification and clustering, especially as the sample size of the state sequences increases. Two applications to real datasets are presented. In the first application, we develop clustering and classification methods for written texts according to rhythmic patterns. We introduce a new retrieval process for rhythm of texts written in English by encoding the morphological structure of sentences with the building blocks of phonological words and the position of stressed syllables. Sequences of syllables in text are modelled with a stochastic process, where the choice of lexical items depends on the rhythmic characteristics of the preceding words. In the second application, we perform unsupervised clustering on click-stream data of users from an online maternity clothing store. The browsing behaviours of users from different countries can be leveraged for optimizing targeted marketing strategies and the constructed VLMCs are used to rank weblinks of the website based on the stationary distributions of the VLMCs.
Read full abstract