Enabling Efficient Distributed Spatial Join on Large Scale Vector-Raster Data Lakes

Sebastian Villarroya,Jose R R Viqueira,Jose M Cotos,Jose A Taboada

doi:10.1109/access.2022.3157405

Abstract

Both the increasing number of GPS-enabled mobile devices and the geographic crowd-sourcing initiatives, such as Open Street Map, are determinants for the large amount of vector spatial data that is currently being produced. On the other hand, the automatic generation of raster data by remote sensing devices and environmental modeling processes was always leading to very large datasets. Currently, huge data generation rates are reached by improved sensor observation systems and data processing infrastructures. As an example, the Sentinel Data Access System of the Copernicus Program of the European Space Agency (ESA) was publishing 38.71 TB of data per day during 2020. This paper shows how the assumption of a new spatial data model that includes multi-resolution parametric spatial data types, enables achieving an efficient implementation of a large scale distributed spatial analysis system for integrated vector-raster data lakes. In particular, the proposed implementation outperforms the state-of-the-art Spark-based spatial analysis systems by more than one order of magnitude during vector-raster spatial join evaluation.

Highlights

T WO major types of spatial datasets exists, namely vector and raster datasets
Much research effort has been devoted to vector spatial data management, which leaded to mature and standardized spatial DBMS solutions [1], [2]
Data storage may still be efficient due to the data compression facilities incorporated in current distributed columnar data storage formats like Apache Parquet [17], data processing may not leverage the sampling nature of raster data to devise more efficient algorithms for spatial operations. It is shown how the assumption of an already existing integrated vector-raster data model approach enables the efficient implementation of a large scale vector-raster spatial on-line data analysis system on top of Apache Spark

Summary

INTRODUCTION

T WO major types of spatial datasets exists, namely vector and raster datasets. Vector datasets contain data of spatial entities, including the vector geometries that represent their location and shape in space. Data storage may still be efficient due to the data compression facilities incorporated in current distributed columnar data storage formats like Apache Parquet [17], data processing may not leverage the sampling nature of raster data to devise more efficient algorithms for spatial operations In this paper, it is shown how the assumption of an already existing integrated vector-raster data model approach enables the efficient implementation of a large scale vector-raster spatial on-line data analysis system on top of Apache Spark. It is shown through exhaustive experimentation how specific optimizations enable achieving response times for vector-raster spatial joins that are more than an order of magnitude faster than those achieved by currently available Spark-based spatial analysis systems.

RELATED WORK

DATA TYPES

DATA STRUCTURES

OPERATIONS

DISTRIBUTED IMPLEMENTATION

CONCLUSIONS