Abstract

Given two datasets of points (called Query and Training), the Group (K) Nearest-Neighbor (GKNN) query retrieves (K) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been studied during the recent years and several performance improving techniques and pruning heuristics have been proposed. In a previous work, we presented the first MapReduce algorithm, consisting of alternating local and parallel phases, which can be used to effectively process the GKNN query when the Query fits in memory, while the Training one belongs to the Big Data category. In subsequent works, we presented several improvements on the first version of the algorithm. In this paper we present yet another improvement, which consists in the prepartitioning of the Training dataset. As shown in the experimentation section, this technique significantly reduces data transfer and total running time of the algorithm. Furthermore, the prepartitioning of the Training dataset is performed only once and can be reused with multiple Query datasets, leading to faster response times.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call