Abstract

Data anonymization strategies such as subtree generalization have been hailed as techniques that provide a more efficient generalization strategy compared to full-tree generalization counterparts. Many subtree-based generalizations strategies (e.g., top-down, bottom-up, and hybrid) have been implemented on the MapReduce platform to take advantage of scalability and parallelism. However, MapReduce inherent lack support for iteration intensive algorithm implementation such as subtree generalization. This paper proposes Distributed Dataset (RDD)-based implementation for a subtree-based data anonymization technique for Apache Spark to address the issues associated with MapReduce-based counterparts. We describe our RDDs-based approach that offers effective partition management, improved memory usage that uses cache for frequently referenced intermediate values, and enhanced iteration support. Our experimental results provide high performance compared to the existing state-of-the-art privacy preserving approaches and ensure data utility and privacy levels required for any competitive data anonymization techniques.

Highlights

  • Privacy preservation is an ongoing and challenging issue that impacts people’s lives on a daily basis

  • We identified that this effect is because of two reasons: (i) The records are required for the measurement of privacy and utility score from Resilient Distributed Dataset (RDD) rather than the complete data records; after each generalization step, the same records are aggregated and represented with the key-value pairs

  • This study proposes a generic framework for implementing subtree-based generations on Apache Spark

Read more

Summary

Introduction

Privacy preservation is an ongoing and challenging issue that impacts people’s lives on a daily basis. Anonylitics [23] used Spark’s default iteration support to implement data anonymization and PRIMA [24] proposes a Spark anonymization strategy to define the utility and generalization level rules for limiting data loss. These existing proposals offer interesting aspects of the k-anonymity-based anonymization strategy, they neither provide any guidelines and strategies as to how different types of subtree data anonymization approaches can be best implemented using Spark as a generic framework nor provide any implications of privacy and utility measure.

Related Work
Subtree Generalization
Preliminaries
Subtree Generalization Algorithm
Review of Subtree Implementation in MapReduce
Partition
Memory
Iteration
Our Proposal
Phase 1—Initialization
Phase 2—Generalization
Phase 3—Validation
Experimental Results
Datasets
System Environment Configurations
Performance and Scalability
Performance Comparison with Existing Subtree Approaches
Performance Comparison with Existing Spark-Based k-Anonymity Approaches
Performance Comparison on Adult and Irish Datasets
Memory Effects on Performance and Scalability
Iteration Effects on Scalability
Conclusions and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.