Nearest Neighbors Can Be Found Efficiently If the Dimension Is Small Relative to the Input Size

Michiel Hagedoorn

doi:10.1007/3-540-36285-1_29

Abstract

We consider the problem of nearest-neighbor search for a set of n data points in d-dimensional Euclidean space. We propose a simple, practical data structure, which is basically a directed acyclic graph in which each node has at most two outgoing arcs. We analyze the performance of this data structure for the setting in which the n data points are chosen independently from a d-dimensional ball under the uniform distribution. In the average case, for fixed dimension d, we achieve a query time of O(log2 n) using only O(n) storage space. For variable dimension, both the query time and the storage space are multiplied with a dimension-dependent factor that is at most exponential in d. This is an improvement over previously known time-space tradeoffs, which all have a super-exponential factor of at least d� (d) either in the query time or in the storage space. Our data structure can be stored efficiently in secondary memory: In a standard secondary-memory model, for fixed dimension d, we achieve average-case bounds of O((log2 n)/B + log n) query time and O(N) storage space, where B is the block-size parameter and N = n/B. Our data structure is not limited to Euclidean space; its definition generalizes to all possible choices of query objects, data objects, and distance functions.

Full Text