Abstract
Given two datasets of points (called Query and Training), the Group (K) Nearest Neighbor (GNN) query retrieves (K) points of the Training dataset with the smallest sum of distances to every point of the Query one. This spatial query has been studied during the recent years and several performance improving techniques and pruning heuristics have been proposed. But this is the first time a parallel and distributed algorithm, using the MapReduce programming framework, is ever used. In this work, we present a multi phased algorithm, consisting of alternating local and parallel phases, which can be used to effectively process the GNN query when the Query dataset fits in memory, but the Training one belongs to the Big Data category. We make use of some of the pruning heuristics and effective calculation techniques of the literature, as well as different indexing methods and finally perform some comparative benchmarks with several datasets.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.