Randomized Embeddings with Slack and High-Dimensional Approximate Nearest Neighbor

Evangelos Anagnostopoulos,Ioannis Psarros,Ioannis Z Emiris

doi:10.1145/3178540

Abstract

Approximate nearest neighbor search (ϵ-ANN) in high dimensions has been mainly addressed by Locality Sensitive Hashing (LSH), which has complexity with polynomial dependence in dimension, sublinear query time, but subquadratic space requirement. We introduce a new “low-quality” embedding for metric spaces requiring that, for some query, there exists an approximate nearest neighbor among the pre-images of its k > 1 approximate nearest neighbors in the target space. In Euclidean spaces, we employ random projections to a dimension inversely proportional to k . Our approach extends to the decision problem with witness of checking whether there exists an approximate near neighbor; this also implies a solution for ϵ-ANN. After dimension reduction, we store points in a uniform grid of side length ϵ /√ d ′ , where d ′ is the reduced dimension. Given a query, we explore cells intersecting the unit ball around the query. This data structure requires linear space and query time in O ( d n ρ ), ρ ≈ 1-ϵ 2 i>/log(1ϵ), where n denotes input cardinality and d space dimension. Bounds are improved for doubling subsets via r -nets. We present our implementation for ϵ-ANN in C++ and experiments for d ≤ 960, n ≤ 10 6 , using synthetic and real datasets, which confirm the theoretical analysis and, typically, yield better practical performance. We compare to FALCONN, the state-of-the-art implementation of multi-probe LSH: our prototype software is essentially comparable in terms of preprocessing, query time, and storage usage.

Full Text