Parallel Processing Strategies for Big Geospatial Data.

Martin Werner

doi:10.3389/fdata.2019.00044

Abstract

This paper provides an abstract analysis of parallel processing strategies for spatial and spatio-temporal data. It isolates aspects such as data locality and computational locality as well as redundancy and locally sequential access as central elements of parallel algorithm design for spatial data. Furthermore, the paper gives some examples from simple and advanced GIS and spatial data analysis highlighting both that big data systems have been around long before the current hype of big data and that they follow some design principles which are inevitable for spatial data including distributed data structures and messaging, which are, however, incompatible with the popular MapReduce paradigm. Throughout this discussion, the need for a replacement or extension of the MapReduce paradigm for spatial data is derived. This paradigm should be able to deal with the imperfect data locality inherent to spatial data hindering full independence of non-trivial computational tasks. We conclude that more research is needed and that spatial big data systems should pick up more concepts like graphs, shortest paths, raster data, events, and streams at the same time instead of solving exactly the set of spatially separable problems such as line simplifications or range queries in manydifferent ways.

Highlights

In the last decade, the term Big Data has been silently identified with web-scale cloud computing systems for handling big data
We will discuss a certain set of spatial algorithm classes and how they fit into the diverse categories of big data computing systems and frameworks
While many systems follow the data distribution (e.g., Kini and Emanuele, 2014; Whitman et al, 2014; Eldawy and Mokbel, 2015; Xie et al, 2016), it has not yet been widely discussed how to follow the query distribution or how to adapt to the query workload during execution. This is an interesting direction for spatial big data research: How can we exploit the joint distribution of queries and data in distributing data across the cluster to solve the tradeoff between query locality and the number of nodes that could contribute to a query execution

Summary

Introduction

The term Big Data has been silently identified with web-scale cloud computing systems for handling big data. We will discuss a certain set of spatial algorithm classes and how they fit into the diverse categories of big data computing systems and frameworks.

Results

Conclusion