Abstract

Top-k similarity join of time series, designed to find top-k most similar pairs of time series records, is a primitive operation widely adopted by many time series data analysis applications. However, computing such top-k similarity join is a challenging problem today, as many modern applications are creating massive amounts of time series data. Obviously, a centralized machine is difficult to perform top-k similarity join in a large time series database efficiently. In this paper, we investigate how to perform the top-k similarity join of massive time series in parallel using MapReduce over a large cluster of commodity machines. Our proposed MapReduce-based algorithm consists of four steps, which takes as input a set of time series records and output an ordered list of top k closest pairs. To improve the efficiency in computing top-k similarity join, we proposed several solutions. We first introduce an efficient distance function based on LSH (Locality Sensitive Hash) for time series to improve the efficiency in pairwise similarity comparison. We next propose all pair partitioning methods to minimize the amount of data transfers between map and reduce functions. Moreover, we make use of serial computation strategy for parallelizing the computation of local top-k closest pairs in each partition. Our performance study confirms the effectiveness and scalability of our MapReduce algorithms.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.