Abstract
Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users to work on distributed in-memory data, without worrying about the data distribution mechanism and fault-tolerance. Given two datasets of points (called Query and Training), the group K nearest-neighbor (GKNN) query retrieves (K) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been actively studied in centralized environments and several performance improving techniques and pruning heuristics have been also proposed, while, a distributed algorithm in Apache Hadoop was recently proposed by our team. Since, in general, Apache Hadoop exhibits lower performance than Spark, in this paper, we present the first distributed GKNN query algorithm in Apache Spark and compare it against the one in Apache Hadoop. This algorithm incorporates programming features and facilities that are specific to Apache Spark. Moreover, techniques that improve performance and are applicable in Apache Spark are also incorporated. The results of an extensive set of experiments with real-world spatial datasets are presented, demonstrating that our Apache Spark GKNN solution, with its improvements, is efficient and a clear winner in comparison to processing this query in Apache Hadoop.
Highlights
Nowadays, a huge amount of spatial data is generated daily from GPS-enabled devices, such as smart phones, smart watches, cars, sensors, location-tagged posts in Facebook, Instagram, etc
We present the first Apache Spark based algorithm for the group K nearest-neighbor (GKNN) query, which is based on the MapReduce algorithm of [13,14], suitably modified to take advantage of Spark specific features
Group (K) Nearest-Neighbor (GKNN) Query. As it was introduced in [5], the GKNN query retrieves the K points from a set P (Training dataset) that has the smallest sum of distances from all points in another set Q (Query dataset)
Summary
A huge amount of spatial data is generated daily from GPS-enabled devices, such as smart phones, smart watches, cars, sensors, location-tagged posts in Facebook, Instagram, etc. The term big spatial data is related to the process of capturing, storing, managing, analyzing, and visualizing huge amounts of spatial data, not using traditional tools and systems How to process such big spatial data efficiently has become one of the current research hotspots. Hadoop MapReduce [1] and Apache Spark [2] are the dominating distributed frameworks for processing and managing big data on a cluster of computers. Apache Spark is a memory-based framework suitable for real-time and batch processing Following these two distributed frameworks, two types of research prototype systems to manage large-scale spatial data query processing have emerged. The other type is Spark-based prototype systems, where Sedona (formerly GeoSpark [4]) is actively under development and several companies are currently using it, because it is very efficient to manage spatial datasets that can all fit into main memory. As it was introduced in [5], the GKNN query retrieves the K points from a set P (Training dataset) that has the smallest sum of distances from all points in another set Q (Query dataset).
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.