Bucket-size balancing locality sensitive hashing using the map reduce paradigm

Kyung Mi Lee,Keon Myung Lee,Yoon-Su Jeong,Sang Ho Lee

doi:10.1007/s10586-017-1013-2

Abstract

Similarity search is an essential operation in such domains as data mining and content-based information retrieval. This simple operation causes considerable burden when the number of data records grows large, especially in big data applications. At the sacrifice of accuracy, approximate methods for finding similar ones have been developed to deliver effective services in a reasonable amount of time. Locality sensitive hashing is a class of efficient approximate similarity search techniques. Various algorithms have been proposed for locality sensitive hashing, which basically try to narrow down the candidate data set to be examined. The candidate data set does not always contain all the similar data to query and thus the search results are approximate. The increase in the size of a candidate set improves the recall of similar ones, but it deteriorates the processing speed. This paper is concerned with a method to increase the recall rate while not entailing significant cost. The method basically uses a random hyperplane partitioning technique to create buckets to which data objects are distributed. The nearest neighbors located on the other side of such hyperplanes can be false negatives when only the bucket to which query belongs is examined for finding similar neighbors. The proposed method extends the hyperplanes to occupy their vicinity so that the data objects in the vicinity of a hyperplane are treated as belonging to both sides of the hyperplane simultaneously. The over-sized buckets are further split by adding additional hyperplanes to control the bucket sizes. To improve the processing speed, the algorithm is realized in MapReduce paradigm on a Hadoop cluster. Some experiment results are presented to show its applicability.

Full Text