Memory-Efficient and Skew-Tolerant MapReduce Over MPI for Supercomputing Systems

Tao Gao,Yutong Lu,Yanfei Guo,Michela Taufer,Pavan Balaji,Pietro Cicotti,Boyu Zhang

doi:10.1109/tpds.2019.2932066

Abstract

Data analytics has become an integral part of large-scale scientific computing. Among various data analytics frameworks, MapReduce has gained the most traction. Although some efforts have been made to enable efficient MapReduce for supercomputing systems, they are often limited to fairly homogeneous workloads where equal partitioning of input data across tasks results in essentially equal output or temporary data generated on each task. For workloads that are more skewed, however, current implementations can result in imbalance in memory usage and, consequently, can cause a slowdown in execution time and a loss in data scalability. To tackle this problem, we enhance a previously published memory-conscious MapReduce over MPI framework called Mimir. Our enhancements to Mimir include combiner and dynamic repartition optimizations to minimize and balance memory usage and to achieve close to optimal balance of the memory usage across processes and to reduce the execution time by up to 12 times. Experimental results show that Mimir can scale to at least 3072 processes on the Tianhe-2 supercomputer on skewed datasets.

Highlights

WITH the growth of simulation and scientific data, data analytics and data-intensive workloads have become an integral part of large-scale scientific computing
2) We evaluate the results of Mimir’s in-memory workflow, the combiner workflow, the dynamic repartition workflow, and the superkey and splitting approach with respect to memory usage and performance on the Tianhe-2 supercomputer for three benchmarks and three different types of datasets: balanced data, value imbalanced data, and key-mapping imbalanced data
We present two data-driven optimizations: a dynamic repartition approach to mitigate the impact of the data skew problem and a splitting strategy to deal with datasets in which a few keys occur significantly more frequently than do the other keys

Summary

Introduction

WITH the growth of simulation and scientific data, data analytics and data-intensive workloads have become an integral part of large-scale scientific computing. MapReduce [11] is one of the most popular programming models within the broad data analytics domain. Most implementations of MapReduce, such as Hadoop [1] and Spark [37], target Linux-based commodity clusters whose features are significantly different from supercomputers in terms of operating systems, networks, and storage features. System software stacks on these platforms, including the operating system and computational libraries, are specialized for scientific computing Supercomputers such as the IBM Blue Gene/Q [2] use specialized lightweight operating systems that do not provide the same capabilities as those that a traditional operating system such as Linux or Windows might. The map phase processes the input data using a user-defined map callback function and generates intermediate hkey; valuei (KV) pairs. The reduce phase processes the KMV lists with a user-defined reduce callback function and generates the final output

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Parallel and Distributed Systems	Publication Date: Dec 1, 2020
Citations: 32	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Memory-Efficient and Skew-Tolerant MapReduce Over MPI for Supercomputing Systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems

Lead the way for us

Similar Papers

Development of Novel Big Data Analytics Framework for Smart Clothing
Mominul Ahsan ... Alhussein Albarbar
IEEE Access | VOL. 8
Mominul Ahsan, et. al.Mominul Ahsan ... Alhussein Albarbar
01 Jan 2020
IEEE Access | VOL. 8

Deriving Frequent Itemsets from Lossless Condensed Representation
A Subashini ... M Karthikeyan
International Journal of Engineering and Advanced Technology | VOL. 9
A Subashini, et. al.A Subashini ... M Karthikeyan
28 Feb 2020
International Journal of Engineering and Advanced Technology | VOL. 9

Data Analytics of IoT Enabled Smart Energy Meter in Smart Cities
Kiran Ahuja ... Arun Khosla
-
Kiran Ahuja, et. al.Kiran Ahuja ... Arun Khosla
11 Dec 2018
11 Dec 2018

RP-DBSCAN
Hwanjun Song ... Jae-Gil Lee
-
Hwanjun Song, et. al.Hwanjun Song ... Jae-Gil Lee
27 May 2018
27 May 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Memory-Efficient and Skew-Tolerant MapReduce Over MPI for Supercomputing Systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems