SparkNN: A Distributed In-Memory Data Partitioning for KNN Queries on Big Spatial Data

Zaher Al Aghbari,Ibrahim Kamel,Tasneem Ismail

doi:10.5334/dsj-2020-035

Abstract

The increase in GPS-enabled devices and proliferation of location-based applications have resulted in an abundance of geotagged (spatial) data. As a consequence, numerous applications have emerged that utilize the spatial data to provide different types of location-based services. However, the huge amount of available spatial data presents a challenge to the efficiency of these location-based services. Although the advent of big data frameworks like Apache Spark has enabled the processing of large amounts of data efficiently, they are designed for general (non-spatial) data. That is due to the build-in data partitioning mechanism that does not take into account the spatial proximity of the data. Therefore, these big data frameworks cannot be readily used for spatial analytics such as efficiently answering spatial queries. To fill this gap, this paper proposes SparkNN, an in-memory partitioning and indexing system for answering spatial queries, such as K-nearest neighbor, on big spatial data. SparkNN is implemented on top of Apache Spark and consists of three layers to facilitate efficient spatial queries. The first layer is a spatial-aware partitioning layer, which partitions the spatial data into several partitions ensuring that the load of the partitions is balanced and data objects with close proximity are placed in the same, or neighboring, partitions. The second layer is a local indexing layer, which provides a spatial index inside each partition to speed up the data search within the partition. The third layer is a global index, which is placed in the master node of Spark to route spatial queries to the relevant partitions. The efficiency of SparkNN was evaluated by extensive experiments with big spatial datasets. The results show SparkNN significantly outperforms the state-of-the-art Spark system when evaluated on the same set of queries.

Highlights

The unparalleled popularity of online social media has resulted in the generation of vast amounts of data in various domains such as Banking, Government, Healthcare, Telecommunications, and Stock markets
SparkNN consists of three layers, which are built on top of the Apache Spark system, namely, Spatial-aware data partitioning layer, Local indexing layer and Global index layer
The paper proposed SparkNN, which is an in-memory partitioning and indexing system to process K-nearest neighbor spatial queries on big spatial data

Summary

Introduction

The unparalleled popularity of online social media has resulted in the generation of vast amounts of data in various domains such as Banking, Government, Healthcare, Telecommunications, and Stock markets. Al Aghbari et al: SparkNN for iterative data processing, which is required by most queries This motivated the emergence of Apache Spark Zaharia et al (2016), which is an in-memory, real-time cluster processing platform to manage big data and facilitate queries. Spark balances the load between the different computing nodes, it randomly distributes the spatial data. SparkNN consists of three layers, which are built on top of the Apache Spark system, namely, Spatial-aware data partitioning layer, Local indexing layer and Global index layer. Due to the huge number of spatial points in every partition, fast access methods are required to retrieve the relevant data point for a given query These local indexes are used to efficiently answer spatial queries, the KNN queries.

Hadoop-based systems

Spark-based systems

SparkNN

Spatial-Aware Data Partitioning Layer

Global indexing

Local indexing

KNN Querying

Dataset

Impact of k

Impact of data size

Scalability

Impact of SARDDs

Conclusion and Future Directions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Data science journal	Publication Date: Aug 24, 2020
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

SparkNN: A Distributed In-Memory Data Partitioning for KNN Queries on Big Spatial Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data science journal

Lead the way for us

Similar Papers

Spatial Concept Query Based on Lattice-Tree
Aopeng Xu ... Zhiyuan Zhang
ISPRS international journal of geo-information | VOL. 11
Aopeng Xu, et. al.Aopeng Xu ... Zhiyuan Zhang
15 May 2022
ISPRS international journal of geo-information | VOL. 11

Big spatial data processing with Apache Spark
Boyi Shangguan ... Liangcun Jiang
-
Boyi Shangguan, et. al.Boyi Shangguan ... Liangcun Jiang
01 Aug 2017
01 Aug 2017

Comparative Analysis of Spark and Ignite for Big Spatial Data Processing
Samah Abuayeid ... Louai Alarabi
International Journal of Advanced Computer Science and Applications | VOL. 12
Samah Abuayeid, et. al.Samah Abuayeid ... Louai Alarabi
01 Jan 2020
International Journal of Advanced Computer Science and Applications | VOL. 12

Spatial big data architecture: From Data Warehouses and Data Lakes to the LakeHouse
Soukaina Ait Errami ... Hassan Badir
Journal of Parallel and Distributed Computing | VOL. 176
Soukaina Ait Errami, et. al.Soukaina Ait Errami ... Hassan Badir
01 Jun 2023
Journal of Parallel and Distributed Computing | VOL. 176

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SparkNN: A Distributed In-Memory Data Partitioning for KNN Queries on Big Spatial Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data science journal