Abstract
One of the biggest concerns of big data and analytics is privacy. We believe the forthcoming frameworks and theories will establish several solutions for the privacy protection. One of the known solutions is the k-anonymity that was introduced for traditional data. Recently, two major frameworks leveraged big data processing and applications; these are MapReduce and Spark. Spark data processing has been attracting more attention due to its crucial impacts on a wide range of big data applications. One of the predominant big data applications is data analytics and anonymization. We previously proposed an anonymization method for implementing k-anonymity in MapReduce processing framework. In this paper, we investigate Spark performance in processing data anonymization. Spark is a fast processing framework that was implemented in several applications such as: SQL, multimedia, and data stream. Our focus is the SQL Spark, which is adequate for big data anonymization. Since Spark operates in-memory, we need to observe its limitations, speed, and fault tolerance on data size increase, and to compare MapReduce to Spark in processing anonymity. Spark introduces an abstraction called resilient distributed datasets, which reads and serializes a collection of objects partitioned across a set of machines. Developers claim that Spark can outperform MapReduce by 10 times in iterative machine learning jobs. Our experiments in this paper compare between MapReduce and Spark. The overall results show a better performance for Spark’s processing time in anonymity operations. However, in some limited cases, we prefer to implement the old MapReduce framework, when the cluster resources are limited and the network is non-congested.
Highlights
Big data evolution has formed new software tools and techniques to accelerate the processing speed, and increase the scalability
Multi‐dimensional sensitivity‐based anonymization We introduce a novel anonymization method using Bottom–Up Generalization (BUG) in k-anonymity that can cope with the big data frameworks
We experimented two of these popular frameworks for our k-anonymity method. Both of Pig and Spark were examined in Multi-Dimensional Sensitivity-Based Anonymization for big data (MDSBA) anonymization process
Summary
Big data evolution has formed new software tools and techniques to accelerate the processing speed, and increase the scalability. Big data is more beneficial when it is shared among multiple entities This means many organizations from different fields need to access this data for multiple purposes [1]. The Q-ID involves finding a group of attributes that can identify other tuples in the database. These identifiers may not gain 100% of data, but the risk of predicting some data remains high. The original k-anonymity method defines minimum generalization and maximum generalization. Any table is called k-anonymous if one tuple has Q-ID values, and at least k − 1 equivalent records have Q-ID values This means that the equivalence group size on Q-ID is at least k [7]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.