Abstract

Abstract. In this paper, a novel framework for spatial data processing is proposed, which apply to National Geographic Conditions Monitoring project of China. It includes 4 layers: spatial data storage, spatial RDDs, spatial operations, and spatial query language. The spatial data storage layer uses HDFS to store large size of spatial vector/raster data in the distributed cluster. The spatial RDDs are the abstract logical dataset of spatial data types, and can be transferred to the spark cluster to conduct spark transformations and actions. The spatial operations layer is a series of processing on spatial RDDs, such as range query, k nearest neighbor and spatial join. The spatial query language is a user-friendly interface which provide people not familiar with Spark with a comfortable way to operation the spatial operation. Compared with other spatial frameworks, it is highlighted that comprehensive technologies are referred for big spatial data processing. Extensive experiments on real datasets show that the framework achieves better performance than traditional process methods.

Highlights

  • The storage layer supports persistent spatial data either on local disk or Hadoop file system (HDFS), but HDFS is recommended for using in cluster environment

  • Each block is represented by the minimum boundary rectangle (MBR) of it records, and all the partition blocks are concatenated into a global R-tree index using their MBRs as the index key by bulk loading process

  • This paper introduced a new Apache Spark-based framework for spatial data processing is proposed, which includes 4 layers: spatial data storage, spatial Resilient Distributed Datasets (RDDs), spatial operations, and spatial query language

Read more

Summary

INTRODUCTION

It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries It support the full suite of OpenGIS Simple Features for SQL spatial predicate functions and operators together with additional topological functions. Another software development kit for processing big spatial data with Apache Spark is SparkSpatialSDK (Shangguan, Yue, Wu, 2017), a fast and. A novel Apache Spark based computing framework for spatial data is introduced It leverages Spark as the under layer to achieve better computing performance than Hadoop. The differences to other Hadoop and Spark spatial compute frameworks are the close integration, both logical and physical, between Hadoop HDFS and Spark spatial RDDs

DETAILS
Spatial Spark SQL Language
Storage Layer
Build index
Index File Structure
Spatial RDDs Layer
Spatial Operations Layer
EXPERIMENTS
PERFORMANCE COMPARISON
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call