Big spatial data processing with Apache Spark

Boyi Shangguan,Peng Yue,Zhaoyan Wu,Liangcun Jiang

doi:10.1109/agro-geoinformatics.2017.8047039

Abstract

Big data technologies have shown great promise for managing geospatial data in recent years. In order to deal with the growing spatial data, a high performance spatial data processing system layered on big data technologies is needed. In this paper, we present an approach to process big spatial data with Apache Spark, a fast and generic engine for large-scale data processing. We developed a software development kit named SparkSpatialSDK, which takes spatial characteristics of geospatial data into consideration and provides a Spark-enabled spatial data structure and API to allow users easily perform spatial analyses with big spatial data. The spatial data structure couples geometric data structure (point, line, and polygon) with Resilient Distributed Datasets (RDD). An interface, called SpatialRDD, is provided to access big spatial data stored in distributed database systems like HBase and load the data in Spark processing engine. We illustrates the applications of the API using some example processing functions such as the spatial range and spatial k-nearest neighbor queries. The results demonstrate the applicability of using SparkSpatialSDK for big geospatial data processing.

Full Text