Abstract

Sequence pattern mining is the mining of special and representative features hidden in sequence data. Recently, it has been attracting a lot of attention, especially in the fields of bioinformatics and spatio-temporal trajectory mining. Observing that many sequence data are born with uncertainties and huge sequence data are increasingly generated and accumulated, this paper aims to discover the hidden features from a large amount of uncertain sequence data. Specifically, Probabilistic Suffix Tree (PST) is an implementation of Variable-length Markov Chain (VMM) that has been widely applied in sequence data mining. However, the conventional PST construction algorithm is not for the mining of uncertain data and cannot bear the computing of huge data. Thus, to mine a large amount of sequence data with uncertainties, this paper proposes the uPST $$_{MR}^+$$ algorithm on the Hadoop platform to fully utilize the computing power and storage capacity of cloud computing. The proposed uPST $$_{MR}^+$$ algorithm constructs a PST in a progressive, multi-layered, and iterative manner so as to avoid excessive learning patterns and balance the overhead of distributed computing. In addition, to prevent the drag on overall performance owing to multiple scanning of the entire sequence data, we trade space for time by using a NodeArray data structure to store the intermediate statistical results to reduce disk I/O. To verify the performance of uPST $$_{MR}^{+}$$ , we conduct several experiments. The experimental results show that uPST $$_{MR}^{+}$$ outperforms the naive approach significantly and show good scalability and stability. Also, although using NodeArray costs a little extra memory, the execution time is significantly lowered.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.