Abstract

Sustainability research faces many challenges as respective environmental, urban and regional contexts are experiencing rapid changes at an unprecedented spatial granularity level, which involves growing massive data and the need for spatial relationship detection at a faster pace. Spatial join is a fundamental method for making data more informative with respect to spatial relations. The dramatic growth of data volumes has led to increased focus on high-performance large-scale spatial join. In this paper, we present Spatial Join with Spark (SJS), a proposed high-performance algorithm, that uses a simple, but efficient, uniform spatial grid to partition datasets and joins the partitions with the built-in join transformation of Spark. SJS utilizes the distributed in-memory iterative computation of Spark, then introduces a calculation-evaluating model and in-memory spatial repartition technology, which optimize the initial partition by evaluating the calculation amount of local join algorithms without any disk access. We compare four in-memory spatial join algorithms in SJS for further performance improvement. Based on extensive experiments with real-world data, we conclude that SJS outperforms the Spark and MapReduce implementations of earlier spatial join approaches. This study demonstrates that it is promising to leverage high-performance computing for large-scale spatial join analysis. The availability of large-sized geo-referenced datasets along with the high-performance computing technology can raise great opportunities for sustainability research on whether and how these new trends in data and technology can be utilized to help detect the associated trends and patterns in the human-environment dynamics.

Highlights

  • Sustainability research faces many challenges as respective environmental, urban and regional contexts are experiencing rapid changes at an unprecedented spatial granularity level, which involves growing massive data and the need for spatial relationship detection at a faster pace

  • Based on the analysis of existing Hadoop-like high-performance spatial join algorithms, we found that the key factors for improving the performance of spatial join are: (1) simplification of the spatial partitioning algorithm to reduce the preprocessing time; (2) optimization of the partition results for both CPU and memory requirements; and (3) improvement of the performance of the local join algorithm

  • We propose an improved in-memory spatial repartition method based on the calculation amount of local join algorithms in order to refine the partition results

Read more

Summary

Introduction

Sustainability research faces many challenges as respective environmental, urban and regional contexts are experiencing rapid changes at an unprecedented spatial granularity level, which involves growing massive data and the need for spatial relationship detection at a faster pace. With the emergence of cloud computing, many studies use open source big data computing frameworks, such as Hadoop MapReduce [12] and Apache Spark [13], to improve spatial join efficiency. These Hadoop-like spatial join algorithms including SJMR (Spatial Join with MapReduce) [14], DJ (Distributed Join) in SpatialHadoop [15] and Hadoop-GIS [16]. Utilizing the iterative computation of Spark, we propose a calculation evaluating model and in-memory spatial repartition technology, which refine the initial partition results of both datasets to limit the processing time of each partition by estimating the time complexity of the local spatial join algorithm. (3) We make performance comparisons among four common in-memory spatial join algorithms in Spark and conclude that the R*-tree index nested-loop join exhibits better performance than other algorithms in a real big data environment

Spark Parallel Computing Framework
Spatial Join Query
Hadoop-Like Spatial Join Approaches
Methods
Calculation Evaluating Model
Spatial Repartition Phase in SJS
Experiments and Evaluation
Experiment Setup and Datasets
Findings
Impact of Number of Nodes and Executor Cores
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.