Pivot selection: Dimension reduction for distance-based indexing

Rui Mao,Willard L Miranker,Daniel P Miranker

doi:10.1016/j.jda.2011.10.004

Rui Mao, Willard L Miranker + Show 1 more

Open Access

https://doi.org/10.1016/j.jda.2011.10.004

Copy DOI

Abstract

Distance-based indexing exploits only the triangle inequality to answer similarity queries in metric spaces. Lacking coordinate structure, mathematical tools in Rn can only be applied indirectly, making it difficult to theoretically study metric-space indexing. Toward solving this problem, a common algorithmic step is to select a small number of special points, called pivots, and map the data objects to a low-dimensional space, one dimension for each pivot, where each dimension represents the distances of a pivot to the data objects. We formalize a “pivot space model” where all the data objects are used as pivots such that data is mapped from metric space to Rn, preserving all the pairwise distances under L∞. With this model, it can be shown that the indexing problem in metric space can be equivalently studied in Rn. Further, we show the necessity of dimension reduction for Rn and that the only effective form of dimension reduction is to select existing dimensions, i.e. pivot selection. The coordinate structure of Rn makes the application of many mathematical tools possible. In particular, Principle Component Analysis (PCA) is incorporated into a heuristic method for pivot selection and shown to be effective over a large range of workloads. We also show that PCA can be used to reliably measure the intrinsic dimension of a metric space.

Full Text