Abstract

Locality sensitive hashing (LSH) is a popular algorithm for approximate nearest neighbour (ANN) search. As LSH partitions vector space uniformly and the distribution of vectors is usually non-uniform, it poorly fits real datasets and has limited efficiency. In this paper, we propose a novel data-dependent locality sensitive hashing (DP-LSH) algorithm, which has a two-level structure. In the first level, we train a number of cluster centres, and use the centres to divide the dataset. So the vectors in each cluster have near uniform distribution. In the second level, we construct LSH tables for each cluster. Given a query, we first determine a few clusters that it belongs to, and perform ANN search in the corresponding LSH tables. Furthermore, we present an optimised distributed scheme and a distributed DP-LSH algorithm. Experimental results on the reference datasets show that the search speed of DP-LSH can be increased by 48 times compared to E2LSH, while keeping high search precision; and the distributed DP-LSH can further improve search efficiency.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call