Abstract

The increase in GPS-enabled devices and proliferation of location-based applications have resulted in an abundance of geotagged (spatial) data. As a consequence, numerous applications have emerged that utilize the spatial data to provide different types of location-based services. However, the huge amount of available spatial data presents a challenge to the efficiency of these location-based services. Although the advent of big data frameworks like Apache Spark has enabled the processing of large amounts of data efficiently, they are designed for general (non-spatial) data. That is due to the build-in data partitioning mechanism that does not take into account the spatial proximity of the data. Therefore, these big data frameworks cannot be readily used for spatial analytics such as efficiently answering spatial queries. To fill this gap, this paper proposes SparkNN, an in-memory partitioning and indexing system for answering spatial queries, such as K-nearest neighbor, on big spatial data. SparkNN is implemented on top of Apache Spark and consists of three layers to facilitate efficient spatial queries. The first layer is a spatial-aware partitioning layer, which partitions the spatial data into several partitions ensuring that the load of the partitions is balanced and data objects with close proximity are placed in the same, or neighboring, partitions. The second layer is a local indexing layer, which provides a spatial index inside each partition to speed up the data search within the partition. The third layer is a global index, which is placed in the master node of Spark to route spatial queries to the relevant partitions. The efficiency of SparkNN was evaluated by extensive experiments with big spatial datasets. The results show SparkNN significantly outperforms the state-of-the-art Spark system when evaluated on the same set of queries.

Highlights

  • The unparalleled popularity of online social media has resulted in the generation of vast amounts of data in various domains such as Banking, Government, Healthcare, Telecommunications, and Stock markets

  • SparkNN consists of three layers, which are built on top of the Apache Spark system, namely, Spatial-aware data partitioning layer, Local indexing layer and Global index layer

  • The paper proposed SparkNN, which is an in-memory partitioning and indexing system to process K-nearest neighbor spatial queries on big spatial data

Read more

Summary

Introduction

The unparalleled popularity of online social media has resulted in the generation of vast amounts of data in various domains such as Banking, Government, Healthcare, Telecommunications, and Stock markets. Al Aghbari et al: SparkNN for iterative data processing, which is required by most queries This motivated the emergence of Apache Spark Zaharia et al (2016), which is an in-memory, real-time cluster processing platform to manage big data and facilitate queries. Spark balances the load between the different computing nodes, it randomly distributes the spatial data. SparkNN consists of three layers, which are built on top of the Apache Spark system, namely, Spatial-aware data partitioning layer, Local indexing layer and Global index layer. Due to the huge number of spatial points in every partition, fast access methods are required to retrieve the relevant data point for a given query These local indexes are used to efficiently answer spatial queries, the KNN queries.

Hadoop-based systems
Spark-based systems
SparkNN
Spatial-Aware Data Partitioning Layer
Global indexing
Local indexing
KNN Querying
Dataset
Impact of k
Impact of data size
Scalability
Impact of SARDDs
Conclusion and Future Directions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call