Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache Spark

Sibghat Ullah Bazai,Hooman Alavizadeh,Julian Jang-Jaccard

doi:10.3390/electronics10050589

Sibghat Ullah Bazai, Hooman Alavizadeh + Show 1 more

Open Access

https://doi.org/10.3390/electronics10050589

Copy DOI

Journal: Electronics	Publication Date: Mar 3, 2021
Citations: 14	License type: CC BY 4.0

Affiliation: Massey University

Abstract

Data anonymization strategies such as subtree generalization have been hailed as techniques that provide a more efficient generalization strategy compared to full-tree generalization counterparts. Many subtree-based generalizations strategies (e.g., top-down, bottom-up, and hybrid) have been implemented on the MapReduce platform to take advantage of scalability and parallelism. However, MapReduce inherent lack support for iteration intensive algorithm implementation such as subtree generalization. This paper proposes Distributed Dataset (RDD)-based implementation for a subtree-based data anonymization technique for Apache Spark to address the issues associated with MapReduce-based counterparts. We describe our RDDs-based approach that offers effective partition management, improved memory usage that uses cache for frequently referenced intermediate values, and enhanced iteration support. Our experimental results provide high performance compared to the existing state-of-the-art privacy preserving approaches and ensure data utility and privacy levels required for any competitive data anonymization techniques.

Highlights

Privacy preservation is an ongoing and challenging issue that impacts people’s lives on a daily basis
We identified that this effect is because of two reasons: (i) The records are required for the measurement of privacy and utility score from Resilient Distributed Dataset (RDD) rather than the complete data records; after each generalization step, the same records are aggregated and represented with the key-value pairs
This study proposes a generic framework for implementing subtree-based generations on Apache Spark

Summary

Introduction

Privacy preservation is an ongoing and challenging issue that impacts people’s lives on a daily basis. Anonylitics [23] used Spark’s default iteration support to implement data anonymization and PRIMA [24] proposes a Spark anonymization strategy to define the utility and generalization level rules for limiting data loss. These existing proposals offer interesting aspects of the k-anonymity-based anonymization strategy, they neither provide any guidelines and strategies as to how different types of subtree data anonymization approaches can be best implemented using Spark as a generic framework nor provide any implications of privacy and utility measure.

Related Work

Subtree Generalization

Preliminaries

Subtree Generalization Algorithm

Review of Subtree Implementation in MapReduce

Partition

Memory

Iteration

Our Proposal

Phase 1—Initialization

Phase 2—Generalization

Phase 3—Validation

Experimental Results

Datasets

System Environment Configurations

Performance and Scalability

Performance Comparison with Existing Subtree Approaches

Performance Comparison with Existing Spark-Based k-Anonymity Approaches

Performance Comparison on Adult and Irish Datasets

Memory Effects on Performance and Scalability

Iteration Effects on Scalability

Conclusions and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Similar Papers

In-Memory Data Anonymization Using Scalable and High Performance RDD Design
Sibghat Ullah Bazai ... Julian Jang-Jaccard
Electronics | VOL. 9
Sibghat Ullah Bazai, et. al.Sibghat Ullah Bazai ... Julian Jang-Jaccard
20 Oct 2020
Electronics | VOL. 9

Performance Metrics Evaluation Towards The Effectiveness of Data Anonymization
Anushree Raj ... Rio D’Souza
-
Anushree Raj, et. al.Anushree Raj ... Rio D’Souza
07 Apr 2023
07 Apr 2023

An Enterprise Data Privacy Governance Model: Security-Centric Multi-Model Data Anonymization
Yağmur Şahi̇n ... İbrahim Dogru
Uluslararası Muhendislik Arastirma ve Gelistirme Dergisi | VOL. -
Yağmur Şahi̇n, et. al.Yağmur Şahi̇n ... İbrahim Dogru
15 Apr 2023
Uluslararası Muhendislik Arastirma ve Gelistirme Dergisi | VOL. -

Anonymized data
Graham Cormode ... Divesh Srivastava
-
Graham Cormode, et. al.Graham Cormode ... Divesh Srivastava
29 Jun 2009
29 Jun 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics