BackgroundExpansion of Internet and its use for on-line activities such as E-Commerce and social networking are producing large volumes of transactional data. This huge data volume resulted from these activities facilitates the analysis and understanding of global trends and interesting patterns used for several decisive purposes. Analytics involved in these processes expose sensitive information present in these datasets, which is a serious privacy threat. To overcome this challenge, few sequential heuristics have been used in past where volumes of data were comparatively accommodating to these sequential heuristics; the current situation is not that much in-line and often results in high execution time. This new challenge of scalability paves a way for experimenting with Big Data approaches (e.g., MapReduce Framework). We have agglomerated the MapReduce framework with adopted heuristics to overcome this challenge of scalability along with much-needed privacy preservation and yields efficient analytic results within bounded execution times.MethodsMapReduce is a parallel programming framework [16] which provides us the opportunity to leverage largely distributed resources to deal with the Big Data analytics. MapReduce allows the resource of a largely distributed system to be utilized in a parallel fashion. The simplicity and high fault-tolerance are the key features which make MapReduce a promising framework. Therefore, we have proposed a two-phase MapReduce version of these adopted heuristics. MapReduce framework divides the whole data into ‘n’ number of data chunks D = {d 1 d ∪ 2 ∪ d 3 ..... ∪ d n } and distributes them over ‘n’ computing nodes to achieve the parallelization. The first phase of MapReduce job runs on each data chunk in order to generate intermediate results, which are further sorted and merged in the second phase to generate final sanitized dataset.ResultsWe conducted three set of experiments, each with five different scenarios corresponding to the different cluster sizes i.e., n = 1,2,3,4,5 where ‘n’ is a number of computing nodes. We compared the approaches with respect to real as well as synthetically generated large datasets. For varying data sizes and varying number of computing nodes, it has been observed that sanitization time required by the MapReduce-based algorithm for same size dataset is much less than the sequential traditional approach. Further, the scalability can be improved by using more number of computing nodes. Lastly, another set of experiments explores the change in sanitization time with varying sizes of the sensitive content present in a dataset. We evaluated the effectiveness of proposed approach in different scenarios, with varying cluster size from 1 to 5 nodes. It has been observed that still the execution time of our approach is much less than traditional schemes. Further, no hiding failure, artifactual patterns have been observed during the experiments as well as in terms of misses cost also the MapReduce version performance is same as of traditional approaches.ConclusionTraditional approaches for data hiding primarily MaxFIA and SWA were lacking with due inability to tackle large voluminous data. To subjugate the new challenge of scalability, we have implemented these basic heuristics with Big Data approach i.e., MapReduce framework. Quantitative evaluations have shown that the fusion of MapReduce framework with these adopted heuristics fulfills its obligatory responsibility of being scalable and many-fold faster for yielding efficient analytic results.
Read full abstract