Skewness-Based Partitioning in SpatialHadoop

Alberto Belussi,Ahmed Eldawy,Sara Migliorini

doi:10.3390/ijgi9040201

Abstract

In recent years, several extensions of the Hadoop system have been proposed for dealing with spatial data. SpatialHadoop belongs to this group of projects and includes some MapReduce implementations of spatial operators, like range queries and spatial join. the MapReduce paradigm is based on the fundamental principle that a task can be parallelized by partitioning data into chunks and performing the same operation on them, (map phase), eventually combining the partial results at the end (reduce phase). Thus, the applied partitioning technique can tremendously affect the performance of a parallel execution, since it is the key point for obtaining balanced map tasks and exploiting the parallelism as much as possible. When uniformly distributed datasets are considered, this goal can be easily obtained by using a regular grid covering the whole reference space for partitioning the geometries of the input dataset; conversely, with skewed distributed datasets, this might not be the right choice and other techniques have to be applied. for instance, SpatialHadoop can produce a global index also by means of a Quadtree-based grid or an Rtree-based grid, which in turn are more expensive index structures to build. This paper proposes a technique based on both a box counting function and a heuristic, rooted on theoretical properties and experimental observations, for detecting the degree of skewness of an input spatial dataset and then deciding which partitioning technique to apply in order to improve as much as possible the performance of subsequent operations. Experiments on both synthetic and real datasets are presented to confirm the effectiveness of the proposed approach.

Highlights

In recent years several application contexts require the analysis of huge amount of data and very frequently the dimensions of interest include spatial properties
As a first example of the kind of issue we want to consider in this paper, we shown in Table 1 the results of the execution in SpatialHadoop of the Distributed Join (DJ) [5], the Range Query (RQ) and of the k-Nearest Neighbor operation (k-NN) when applied to different situations
We summarize the main characteristic of the MapReduce implementation of spatial operations like spatial join, range query and K-nearest neighbor, together with the main partitioning technique usually available in cluster systems dedicated to spatial data, such as SpatialHadoop

Summary

Introduction

In recent years several application contexts require the analysis of huge amount of data and very frequently the dimensions of interest include spatial properties. The MapReduce paradigm has been successfully applied to implement parallel solution for those spatial operations that are typically required for performing spatial data analysis. We summarize the main characteristic of the MapReduce implementation of spatial operations like spatial join, range query and K-nearest neighbor, together with the main partitioning technique usually available in cluster systems dedicated to spatial data, such as SpatialHadoop. Hadoop traditionally applies a random division of the input data, during split generation the only prescribed constraint regards the size in bytes of such splits on the HDFS (Hadoop Distributed File System) This naïve partitioning cannot be the right choice during spatial analysis for which some filtering or pruning is always performed for evaluating spatial predicates

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: ISPRS international journal of geo-information	Publication Date: Mar 27, 2020
Citations: 13	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Skewness-Based Partitioning in SpatialHadoop

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: ISPRS international journal of geo-information

Lead the way for us

Similar Papers

Detecting skewness of big spatial data in SpatialHadoop
Alberto Belussi ... Ahmed Eldawy
-
Alberto Belussi, et. al.Alberto Belussi ... Ahmed Eldawy
06 Nov 2018
06 Nov 2018

Parallel Processing Strategies for Big Geospatial Data.
Martin Werner
Frontiers in Big Data | VOL. 2
Martin WernerMartin Werner
03 Dec 2019
Frontiers in Big Data | VOL. 2

Selectivity estimation for spatial joins
Ning An ... A Sivasubramaniam
-
Ning An, et. al. Ning An ... A Sivasubramaniam
02 Apr 2001
02 Apr 2001

ASPEN
Haojun Wang ... Roger Zimmermann
-
Haojun Wang, et. al.Haojun Wang ... Roger Zimmermann
04 Nov 2005
04 Nov 2005

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Skewness-Based Partitioning in SpatialHadoop

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: ISPRS international journal of geo-information