Experimenting sensitivity-based anonymization framework in apache spark

Mohammed Al-Zobbi,Chun Ruan,Seyed Shahrestani

doi:10.1186/s40537-018-0149-0

Mohammed Al-Zobbi, Chun Ruan + Show 1 more

Open Access

https://doi.org/10.1186/s40537-018-0149-0

Copy DOI

Journal: Journal of Big Data	Publication Date: Oct 13, 2018
Citations: 7	License type: open-access

Affiliation: Western Sydney University

Abstract

One of the biggest concerns of big data and analytics is privacy. We believe the forthcoming frameworks and theories will establish several solutions for the privacy protection. One of the known solutions is the k-anonymity that was introduced for traditional data. Recently, two major frameworks leveraged big data processing and applications; these are MapReduce and Spark. Spark data processing has been attracting more attention due to its crucial impacts on a wide range of big data applications. One of the predominant big data applications is data analytics and anonymization. We previously proposed an anonymization method for implementing k-anonymity in MapReduce processing framework. In this paper, we investigate Spark performance in processing data anonymization. Spark is a fast processing framework that was implemented in several applications such as: SQL, multimedia, and data stream. Our focus is the SQL Spark, which is adequate for big data anonymization. Since Spark operates in-memory, we need to observe its limitations, speed, and fault tolerance on data size increase, and to compare MapReduce to Spark in processing anonymity. Spark introduces an abstraction called resilient distributed datasets, which reads and serializes a collection of objects partitioned across a set of machines. Developers claim that Spark can outperform MapReduce by 10 times in iterative machine learning jobs. Our experiments in this paper compare between MapReduce and Spark. The overall results show a better performance for Spark’s processing time in anonymity operations. However, in some limited cases, we prefer to implement the old MapReduce framework, when the cluster resources are limited and the network is non-congested.

Highlights

Big data evolution has formed new software tools and techniques to accelerate the processing speed, and increase the scalability
Multi‐dimensional sensitivity‐based anonymization We introduce a novel anonymization method using Bottom–Up Generalization (BUG) in k-anonymity that can cope with the big data frameworks
We experimented two of these popular frameworks for our k-anonymity method. Both of Pig and Spark were examined in Multi-Dimensional Sensitivity-Based Anonymization for big data (MDSBA) anonymization process

Summary

Introduction

Big data evolution has formed new software tools and techniques to accelerate the processing speed, and increase the scalability. Big data is more beneficial when it is shared among multiple entities This means many organizations from different fields need to access this data for multiple purposes [1]. The Q-ID involves finding a group of attributes that can identify other tuples in the database. These identifiers may not gain 100% of data, but the risk of predicting some data remains high. The original k-anonymity method defines minimum generalization and maximum generalization. Any table is called k-anonymous if one tuple has Q-ID values, and at least k − 1 equivalent records have Q-ID values This means that the equivalence group size on Q-ID is at least k [7]

Objectives

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Experimenting sensitivity-based anonymization framework in apache spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Explore Big Data Analytics Applications and Opportunities: A Review
Zaher Ali Al-Sai ... Rasha Moh’D Sadeq Abdin
Big Data and Cognitive Computing | VOL. 6
Zaher Ali Al-Sai, et. al.Zaher Ali Al-Sai ... Rasha Moh’D Sadeq Abdin
14 Dec 2022
Big Data and Cognitive Computing | VOL. 6

Chapter 7 - Public Transportation Big Data Mining and Analysis
Xiaolei Ma ... Xi Chen
Data-Driven Solutions to Transportation Problems | VOL. -
Xiaolei Ma, et. al.Xiaolei Ma ... Xi Chen
07 Dec 2018
Data-Driven Solutions to Transportation Problems | VOL. -

Cloud computing and big data: Technologies and applications
Mostapha Zbakh ... Mohamed Essaaidi
Concurrency and Computation: Practice and Experience | VOL. 29
Mostapha Zbakh, et. al.Mostapha Zbakh ... Mohamed Essaaidi
29 Mar 2017
Concurrency and Computation: Practice and Experience | VOL. 29

Cloud computing and big data: Technologies and applications
Mostapha Zbakh ... Mohamed Essaaidi
Concurrency and Computation: Practice and Experience | VOL. 30
Mostapha Zbakh, et. al.Mostapha Zbakh ... Mohamed Essaaidi
20 May 2018
Concurrency and Computation: Practice and Experience | VOL. 30

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Experimenting sensitivity-based anonymization framework in apache spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data