Abstract

Given two datasets of points (called Query and Training), the Group (K) Nearest Neighbor (GNN) query retrieves (K) points of the Training dataset with the smallest sum of distances to every point of the Query one. This spatial query has been studied during the recent years and several performance improving techniques and pruning heuristics have been proposed. But this is the first time a parallel and distributed algorithm, using the MapReduce programming framework, is ever used. In this work, we present a multi phased algorithm, consisting of alternating local and parallel phases, which can be used to effectively process the GNN query when the Query dataset fits in memory, but the Training one belongs to the Big Data category. We make use of some of the pruning heuristics and effective calculation techniques of the literature, as well as different indexing methods and finally perform some comparative benchmarks with several datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call