Abstract

IntroductionMany data analytics algorithms are originally designed for in-memory data. Parallel and distributed computing is a natural first remedy to scale these algorithms to “Big algorithms” for large-scale data. Advances in many Big Data analytics algorithms are contributed by MapReduce, a programming paradigm that enables parallel and distributed execution of massive data processing on large clusters of machines. Much research has focused on building efficient naive MapReduce-based algorithms or extending MapReduce mechanisms to enhance performance. However, we argue that these should not be the only research directions to pursue. We conjecture that when naive MapReduce-based solutions do not perform well, it could be because certain classes of algorithms are not amendable to MapReduce model and one should find a fundamentally different approach to a new MapReduce-based solution.Case descriptionThis paper investigates a case study of a scaling problem of “Big algorithms” for a popular association rule-mining algorithm, particularly the development of Apriori algorithm in MapReduce model.Discussion and evaluationFormal and empirical illustrations are explored to compare our proposed MapReduce-based Apriori algorithm with previous solutions. The findings support our conjecture and our study shows promising results compared to the state-of-the-art performer with 7% increase in performance on the average of transactions ranging from 10,000 to 120,000.ConclusionsThe results confirm that effective MapReduce implementation should avoid dependent iterations, such as that of the original sequential Apriori algorithm. These findings could lead to many more alternative non-naive MapReduce-based “Big algorithms”.

Highlights

  • Many data analytics algorithms are originally designed for in-memory data

  • The results confirm that effective MapReduce implementation should avoid dependent iterations, such as that of the original sequential Apriori algorithm

  • This paper presents a study of the applicability of MapReduce for scaling data analytic and machine learning algorithms to “Big algorithms” for Big Data

Read more

Summary

Discussion and evaluation

The proposed non-naive AprioriS algorithm has several advantages. First, it has a simple concept that is easy to understand and implement. Most importantly, based on both theoretical and empirical results, AprioriS is highly effective in performance while producing the same accuracy It requires one scan of database and a single phase of MapReduce. Some are not and the transitions of these algorithms to MapReduce paradigm have shown to be much more complex or ineffective [16] Examples of such algorithms include multiple iterative algorithms, some of which require a chained of data to be processed for convergence or to be updated after each iteration [25, 33]. This clearly adds overhead in communication and data movement To parallelize these algorithms, we do not necessarily follow the naive MapReduce-based implementation that mimics original sequential processes but to look for alternative solutions that effectively exploit parallelism

Conclusions
Introduction
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.