Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

Panagiotis Moutafis,George Mavrommatis,Michael Vassilakopoulos,Antonio Corral

doi:10.3390/ijgi10110763

Panagiotis Moutafis, George Mavrommatis + Show 2 more

Open Access

https://doi.org/10.3390/ijgi10110763

Copy DOI

Abstract

Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users to work on distributed in-memory data, without worrying about the data distribution mechanism and fault-tolerance. Given two datasets of points (called Query and Training), the group K nearest-neighbor (GKNN) query retrieves (K) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been actively studied in centralized environments and several performance improving techniques and pruning heuristics have been also proposed, while, a distributed algorithm in Apache Hadoop was recently proposed by our team. Since, in general, Apache Hadoop exhibits lower performance than Spark, in this paper, we present the first distributed GKNN query algorithm in Apache Spark and compare it against the one in Apache Hadoop. This algorithm incorporates programming features and facilities that are specific to Apache Spark. Moreover, techniques that improve performance and are applicable in Apache Spark are also incorporated. The results of an extensive set of experiments with real-world spatial datasets are presented, demonstrating that our Apache Spark GKNN solution, with its improvements, is efficient and a clear winner in comparison to processing this query in Apache Hadoop.

Highlights

Nowadays, a huge amount of spatial data is generated daily from GPS-enabled devices, such as smart phones, smart watches, cars, sensors, location-tagged posts in Facebook, Instagram, etc
We present the first Apache Spark based algorithm for the group K nearest-neighbor (GKNN) query, which is based on the MapReduce algorithm of [13,14], suitably modified to take advantage of Spark specific features
Group (K) Nearest-Neighbor (GKNN) Query. As it was introduced in [5], the GKNN query retrieves the K points from a set P (Training dataset) that has the smallest sum of distances from all points in another set Q (Query dataset)

Summary

Introduction

A huge amount of spatial data is generated daily from GPS-enabled devices, such as smart phones, smart watches, cars, sensors, location-tagged posts in Facebook, Instagram, etc. The term big spatial data is related to the process of capturing, storing, managing, analyzing, and visualizing huge amounts of spatial data, not using traditional tools and systems How to process such big spatial data efficiently has become one of the current research hotspots. Hadoop MapReduce [1] and Apache Spark [2] are the dominating distributed frameworks for processing and managing big data on a cluster of computers. Apache Spark is a memory-based framework suitable for real-time and batch processing Following these two distributed frameworks, two types of research prototype systems to manage large-scale spatial data query processing have emerged. The other type is Spark-based prototype systems, where Sedona (formerly GeoSpark [4]) is actively under development and several companies are currently using it, because it is very efficient to manage spatial datasets that can all fit into main memory. As it was introduced in [5], the GKNN query retrieves the K points from a set P (Training dataset) that has the smallest sum of distances from all points in another set Q (Query dataset).

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: ISPRS International Journal of Geo-Information	Publication Date: Nov 11, 2021
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: ISPRS International Journal of Geo-Information

Lead the way for us

Similar Papers

BigData Analysis in Healthcare: Apache Hadoop , Apache spark and Apache Flink
Elham Nazari ... Hamed Tabesh
Frontiers in Health Informatics | VOL. 8
Elham Nazari, et. al.Elham Nazari ... Hamed Tabesh
27 Jul 2019
Frontiers in Health Informatics | VOL. 8

Large-scale data mining analytics based on MapReduce

-

01 Jan 2014
01 Jan 2014

R*-Grove: Balanced Spatial Partitioning for Large-Scale Datasets
Tin Vu ... Ahmed Eldawy
Frontiers in Big Data | VOL. 3
Tin Vu, et. al.Tin Vu ... Ahmed Eldawy
28 Aug 2020
Frontiers in Big Data | VOL. 3

A review on big data based parallel and distributed approaches of pattern mining
Sunil Kumar ... Krishna Kumar Mohbey
Journal of King Saud University - Computer and Information Sciences | VOL. 34
Sunil Kumar, et. al.Sunil Kumar ... Krishna Kumar Mohbey
17 Sep 2019
Journal of King Saud University - Computer and Information Sciences | VOL. 34

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: ISPRS International Journal of Geo-Information