Abstract

Locality Sensitive Hashing (LSH) is one of the most efficient approaches to the nearest neighbor search problem in high dimensional spaces. A family H of hash functions is called locality sensitive if the collision probability ph(r) of any two points 〈q,p〉 at distance r over a random hash function h decreases with r. The classic LSH algorithm employs a data structure consisting of k⁎ℓ randomly chosen hash functions to achieve more desirable collision curves and the collision probability Phkℓ(r) for 〈q,p〉 is equal to 1−(1−ph(r)k)ℓ. The great success of LSH is usually attributed to the solid theoretical guarantee for Phkℓ(r) and ph(r).In practice, however, users are more interested in recall rate, i.e., the probability that a random query collides with its r-near neighbor over a fixed LSH data structure hℓk. Implicitly or explicitly, Phkℓ(r) is often misinterpreted as recall rate and used to predict the performance of LSH. This is problematic because Phkℓ(r) is actually the expectation of recall rates. Interestingly, numerous empirical studies show that, for most (if not all) real datasets and a fixed sample of random LSH data structure, the recall rate is very close to Phkℓ(r). In this paper, we provide a theoretical justification for this phenomenon. We show that (1) for random datasets the recall rate is asymptotically equal to Phkℓ(r); (2) for arbitrary datasets the variance of the recall rate is very small as long as the parameter k and ℓ are properly chosen and the size of datasets is large enough. Our analysis (1) explains why the practical performance of LSH (the recall rate) matches so well with the theoretical expectation (Phkℓ(r)); and (2) indicates that, in addition to the nice theoretical guarantee, the mechanism by which LSH data structures are constructed and the huge amount of data are also the main causes for the success of LSH in practice.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call