Abstract
Data anonymization strategies such as subtree generalization have been hailed as techniques that provide a more efficient generalization strategy compared to full-tree generalization counterparts. Many subtree-based generalizations strategies (e.g., top-down, bottom-up, and hybrid) have been implemented on the MapReduce platform to take advantage of scalability and parallelism. However, MapReduce inherent lack support for iteration intensive algorithm implementation such as subtree generalization. This paper proposes Distributed Dataset (RDD)-based implementation for a subtree-based data anonymization technique for Apache Spark to address the issues associated with MapReduce-based counterparts. We describe our RDDs-based approach that offers effective partition management, improved memory usage that uses cache for frequently referenced intermediate values, and enhanced iteration support. Our experimental results provide high performance compared to the existing state-of-the-art privacy preserving approaches and ensure data utility and privacy levels required for any competitive data anonymization techniques.
Highlights
Privacy preservation is an ongoing and challenging issue that impacts people’s lives on a daily basis
We identified that this effect is because of two reasons: (i) The records are required for the measurement of privacy and utility score from Resilient Distributed Dataset (RDD) rather than the complete data records; after each generalization step, the same records are aggregated and represented with the key-value pairs
This study proposes a generic framework for implementing subtree-based generations on Apache Spark
Summary
Privacy preservation is an ongoing and challenging issue that impacts people’s lives on a daily basis. Anonylitics [23] used Spark’s default iteration support to implement data anonymization and PRIMA [24] proposes a Spark anonymization strategy to define the utility and generalization level rules for limiting data loss. These existing proposals offer interesting aspects of the k-anonymity-based anonymization strategy, they neither provide any guidelines and strategies as to how different types of subtree data anonymization approaches can be best implemented using Spark as a generic framework nor provide any implications of privacy and utility measure.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.