Data series has been one of the significant data forms in various applications. It becomes imperative to devise a data series index that supports both approximate and exact similarity searches for large data series collections in high-dimensional metric spaces. The state-of-the-art works employ summarizations and indices to reduce the accesses to the data series. However, we discover two significant flaws that severely limit performance enhancement. Firstly, the state-of-the-art works often employ segment-based summarizations, whose lower bound distances decrease significantly when representing a data series collection, resulting in numerous invalid accesses. Secondly, the disk-based indices for the exact search mainly rely on tree-based indices, which results in low-quality approximate answers, consequently impacting the exact search. To address these problems, we propose a novel solution, Double Indices and Double Summarizations (DIDS). Besides segment-based summarizations, DIDS introduces reference-point-based summarizations to improve the pruning rate by the sorted-based representation strategy. Moreover, DIDS employs reference points and a cost model to cluster similar data series, and uses a graph-based approach to interconnect various regions, enhancing approximate search capabilities. We conduct experiments on extensive datasets, validating the superior search performance of DIDS.
Read full abstract