Abstract

Privacy Preserving and Anonymity have gained significant concern from the big data perspective. We have the view that the forthcoming frameworks and theories will establish several solutions for privacy protection. The k-anonymity is considered a key solution that has been widely employed to prevent data re-identifcation and concerns us in the context of this work. Data modeling has also gained significant attention from the big data perspective. It is believed that the advancing distributed environments will provide users with several solutions for efficient spatio-temporal data management. GeoSpark will be utilized in the current work as it is a key solution that has been widely employed for spatial data. Specifically, it works on the top of Apache Spark, the main framework leveraged from the research community and organizations for big data transformation, processing and visualization. To this end, we focused on trajectory data representation so as to be applicable to the GeoSpark environment, and a GeoSpark-based approach is designed for the efficient management of real spatio-temporal data. Th next step is to gain deeper understanding of the data through the application of k nearest neighbor (k-NN) queries either using indexing methods or otherwise. The k-anonymity set computation, which is the main component for privacy preservation evaluation and the main issue of our previous works, is evaluated in the GeoSpark environment. More to the point, the focus here is on the time cost of k-anonymity set computation along with vulnerability measurement. The extracted results are presented into tables and figures for visual inspection.

Highlights

  • There is no doubt that we live in the era of Big Data

  • The goal of this paper is to provide an efficient and scalable framework for robust continuous k nearest neighbor (k-NN) querying of spatial objects in GeoSpark

  • We could consider to compute AkNN queries based on kdANN or kdANN +, instead of simple k-NN queries and study the effect of d on the performance of such queries and in k-anonymity set computation performance, which is the main issue in the context of this research work

Read more

Summary

Introduction

There is no doubt that we live in the era of Big Data. Over the last decade, thanks to technological advances, information systems have favored automatic and effective data gathering, resulting in a considerable increase in the amount of available data. A wide range of data is produced: scientific, financial, health data, as well as from social media, are just some examples of sources. This data is useless without the extraction of the underlying knowledge, a major challenge for the researchers as classical machine learning methods cannot deal with the volume, value, veracity and variety that big data brings [1]. Existing machine learning techniques, which deal with 4 Vs [2], have been or need to be redefined for efficiently processing and managing such data, as well as to obtain valuable information that can benefit scientists, and businesses and organizations. Most of the existing methods fail to directly tackle the increased number of attributes and records

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call