Abstract

Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this paper, we present new techniques for spatial query processing and optimization in an in-memory and distributed setup to address scalability. More specifically, we introduce new techniques for handling query skew that commonly happens in practice, and minimizes communication costs accordingly. We propose a distributed query scheduler that uses a new cost model to minimize the cost of spatial query processing. The scheduler generates query execution plans that minimize the effect of query skew. The query scheduler utilizes new spatial indexing techniques based on bitmap filters to forward queries to the appropriate local nodes. Each local computation node is responsible for optimizing and selecting its best local query execution plan based on the indexes and the nature of the spatial queries in that node. All the proposed spatial query processing and optimization techniques are prototyped inside Spark, a distributed memory-based computation system. Our prototype system is termed LocationSpark. The experimental study is based on real datasets and demonstrates that LocationSpark can enhance distributed spatial query processing by up to an order of magnitude over existing in-memory and distributed spatial systems.

Highlights

  • Spatial computing is becoming increasingly important with the proliferation of mobile devices

  • LOCATIONSPARK only searches for data partitions that contribute to the kNN query point based on the global and local spatial indexes and the sFilter

  • We present LOCATIONSPARK, a query executor, and an optimizer based on Spark to improve the query execution plan generated for spatial queries

Read more

Summary

INTRODUCTION

Spatial computing is becoming increasingly important with the proliferation of mobile devices. MapReduce-based systems allow users to run spatial queries using predefined high-level spatial operators without worrying about fault tolerance or computation distribution These systems have the following two main limitations: (1) They do not leverage the power of distributed memory, and (2) They are unable to reuse intermediate data (Zaharia, 2016). A kNN join [Figure 1 (right)] returns the k nearest-neighbors from the dataset D for each query point q ∈ Q Both spatial operators are expensive, and may incur computation skew in certain workers, greatly degrading the overall performance. Consider a large spatial dataset, with millions of points of interests (POIs), that is partitioned into different computation nodes based on the spatial distribution of the data, e.g., one data partition represents data from San Francisco, CA, and another represents data from Chicago, IL.

Data Model and Spatial Operators
Overview of In-memory Distributed Spatial Query Processing in LocationSpark
Challenges
QUERY PLAN SCHEDULER
The Cost Model
Execution Plan Generation
A Greedy Algorithm
LOCAL EXECUTION
Spatial Range Join
SPATIAL BITMAP FILTER
Binary Encoding of the sFilter
Query Processing Using the sFilter
Query-Aware Adaptivity of the sFilter
PERFORMANCE STUDY
Experimental Setup
Spatial Range Select and Join
Performance of kNN Select and Join
RELATED WORK
CONCLUSIONS
Findings
DATA AVAILABILITY STATEMENT
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call