Abstract

Data analytics has become an integral part of large-scale scientific computing. Among various data analytics frameworks, MapReduce has gained the most traction. Although some efforts have been made to enable efficient MapReduce for supercomputing systems, they are often limited to fairly homogeneous workloads where equal partitioning of input data across tasks results in essentially equal output or temporary data generated on each task. For workloads that are more skewed, however, current implementations can result in imbalance in memory usage and, consequently, can cause a slowdown in execution time and a loss in data scalability. To tackle this problem, we enhance a previously published memory-conscious MapReduce over MPI framework called Mimir. Our enhancements to Mimir include combiner and dynamic repartition optimizations to minimize and balance memory usage and to achieve close to optimal balance of the memory usage across processes and to reduce the execution time by up to 12 times. Experimental results show that Mimir can scale to at least 3072 processes on the Tianhe-2 supercomputer on skewed datasets.

Highlights

  • WITH the growth of simulation and scientific data, data analytics and data-intensive workloads have become an integral part of large-scale scientific computing

  • 2) We evaluate the results of Mimir’s in-memory workflow, the combiner workflow, the dynamic repartition workflow, and the superkey and splitting approach with respect to memory usage and performance on the Tianhe-2 supercomputer for three benchmarks and three different types of datasets: balanced data, value imbalanced data, and key-mapping imbalanced data

  • We present two data-driven optimizations: a dynamic repartition approach to mitigate the impact of the data skew problem and a splitting strategy to deal with datasets in which a few keys occur significantly more frequently than do the other keys

Read more

Summary

Introduction

WITH the growth of simulation and scientific data, data analytics and data-intensive workloads have become an integral part of large-scale scientific computing. MapReduce [11] is one of the most popular programming models within the broad data analytics domain. Most implementations of MapReduce, such as Hadoop [1] and Spark [37], target Linux-based commodity clusters whose features are significantly different from supercomputers in terms of operating systems, networks, and storage features. System software stacks on these platforms, including the operating system and computational libraries, are specialized for scientific computing Supercomputers such as the IBM Blue Gene/Q [2] use specialized lightweight operating systems that do not provide the same capabilities as those that a traditional operating system such as Linux or Windows might. The map phase processes the input data using a user-defined map callback function and generates intermediate hkey; valuei (KV) pairs. The reduce phase processes the KMV lists with a user-defined reduce callback function and generates the final output

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.