In-Memory Data Anonymization Using Scalable and High Performance RDD Design

Sibghat Ullah Bazai,Julian Jang-Jaccard

doi:10.3390/electronics9101732

Sibghat Ullah Bazai, Julian Jang-Jaccard

Open Access

https://doi.org/10.3390/electronics9101732

Copy DOI

Journal: Electronics	Publication Date: Oct 20, 2020
Citations: 9	License type: CC BY 4.0

Affiliation: Massey University

Abstract

Recent studies in data anonymization techniques have primarily focused on MapReduce. However, these existing MapReduce based approaches often suffer from many performance overheads due to their inappropriate use of data allocation, expensive disk I/O access and network transfer, and no support for iterative tasks. We propose “SparkDA” which is a new novel anonymization technique that is designed to take the full advantage of Spark platform to generate privacy-preserving anonymized dataset in the most efficient way possible. Our proposal offers a better partition control, in-memory operation and cache management for iterative operations that are heavily utilised for data anonymization processing. Our proposal is based on Spark’s Resilient Distributed Dataset (RDD) with two critical operations of RDD, such as FlatMapRDD and ReduceByKeyRDD, respectively. The experimental results demonstrate that our proposal outperforms the existing approaches in terms of performance and scalability while maintaining high data privacy and utility levels. This illustrates that our proposal is capable to be used in a wider big data applications that demands privacy.

Highlights

The rapid growth of data from many domainshas brought in a new era where extracting potential information using data analytic and data mining has become a top business priority to many organizations
In addition to additional utility measurement matrices for Discenibility Metric (DM) and Minimal Distortion (MD), we provide a new set of privacy measurement matrices, such as Kullback-Leibler-Divergence (KLD) and Information Entropy (IE ), to extensively investigate the privacy and utility trade-offs of our proposal
Data anonymization starts by counting the frequency, which represents the number of appearances given the record set, over the Quasi Identifiers Attributes (QID)

Summary

Introduction

The rapid growth of data from many domains (e.g., social media, smartphones, IoT etc.). With the surge in the population of Spark and shift from MapReduce approach, many Spark-based data anonymization techniques have been proposed [15,16,17,18,19] These existing proposals often tend to focus their efforts on improving and readdressing the scalability aspects to be more suited for Spark instead of investigating the suitability of Spark as a platform of choice for data anonymization techniques. This is an extension of the earlier version which we presented in Reference [20].

Related Work

Background

Data Anonymization

Basic Symbols and Notations

RDD-Based Data Anonymization

Overall SparkDA Scheme

Privacy Metrics

Utility Metrics

Experimental Results

Datasets

System Environment Configurations

Privacy Results

Utility Results

Scalability

Performance

Caching

Conclusions and Future Work

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

In-Memory Data Anonymization Using Scalable and High Performance RDD Design

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Similar Papers

SparkDA: RDD-Based High-Performance Data Anonymization Technique for Spark Platform
Sibghat Ullah Bazai ... Julian Jang-Jaccard
-
Sibghat Ullah Bazai, et. al.Sibghat Ullah Bazai ... Julian Jang-Jaccard
01 Jan 2019
01 Jan 2019

Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache Spark
Sibghat Ullah Bazai ... Hooman Alavizadeh
Electronics | VOL. 10
Sibghat Ullah Bazai, et. al.Sibghat Ullah Bazai ... Hooman Alavizadeh
03 Mar 2021
Electronics | VOL. 10

Implications of Data Anonymization on the Statistical Evidence of Disparity
Heng Xu ... Nan Zhang
SSRN Electronic Journal | VOL. -
Heng Xu, et. al.Heng Xu ... Nan Zhang
05 Sep 2020
SSRN Electronic Journal | VOL. -

Data Encryption and Anonymization Techniques for Enhanced Information System Security and Privacy
Arief Budi Pratomo ... Adit Mohammad Aziz
West Science Information System and Technology | VOL. 1
Arief Budi Pratomo, et. al.Arief Budi Pratomo ... Adit Mohammad Aziz
28 Aug 2023
West Science Information System and Technology | VOL. 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

In-Memory Data Anonymization Using Scalable and High Performance RDD Design

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics