Abstract

Abstract. In the geospatial domain we have now reached the point where data volumes we handle have clearly grown beyond the capacity of most desktop computers. This is particularly true in the area of point cloud processing. It is therefore naturally lucrative to explore established big data frameworks for big geospatial data. The very first hurdle is the import of geospatial data into big data frameworks, commonly referred to as data ingestion. Geospatial data is typically encoded in specialised binary file formats, which are not naturally supported by the existing big data frameworks. Instead such file formats are supported by software libraries that are restricted to single CPU execution. We present an approach that allows the use of existing point cloud file format libraries on the Apache Spark big data framework. We demonstrate the ingestion of large volumes of point cloud data into a compute cluster. The approach uses a map function to distribute the data ingestion across the nodes of a cluster. We test the capabilities of the proposed method to load billions of points into a commodity hardware compute cluster and we discuss the implications on scalability and performance. The performance is benchmarked against an existing native Apache Spark data import implementation.

Highlights

  • While the origins of the term “Big Data” itself might be complex (Diebold, 2012) and disputed, one of the most commonly accepted definitions of the term was given by Laney (2001)

  • For this work we focus on the first step in the Big Data pipeline, the import of data into the framework, commonly referred to as data ingestion

  • Apache Spark heavily relies on the concept of Resilient Distributed Datasets (RDDs) (Zaharia et al, 2012)

Read more

Summary

INTRODUCTION

While the origins of the term “Big Data” itself might be complex (Diebold, 2012) and disputed, one of the most commonly accepted definitions of the term was given by Laney (2001). He observes “data management challenges along three dimensions: volume, velocity and variety”. The phenomenon is not unknown to the geospatial community and big spatial data has been identified as an emerging research trend (Eldawy and Mokbel, 2015a). We will focus on a special area of big spatial data and a particular challenge in data management.

Point Cloud Data Use
Geo Data as Big Data
Cloud Compute Engines
IQmulus Architecture
Spark SQL IQmulus Library
Single CPU libraries
PROPOSED METHOD
Naïve Sideloading
Slicing
Data Sets
Medium Dataset
Large Dataset
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call