ELS: Emulation system for debugging and tuning large-scale parallel programs on small clusters

Fang Lin,Yayu Guo,Yi Liu,Depei Qian

doi:10.1007/s11227-020-03319-6

Abstract

Continuous scaling-up of high-performance computing systems has brought challenges to the debugging and tuning of large-scale parallel programs. Firstly, to locate bugs in a program or tune its performance, programmer often needs to execute the program in a specified scale repeatedly, which consumes massive resources; secondly, due to the extensively used job scheduling systems, programmers can only submit their programs as jobs and cannot interact with them, which restricts debugging efficiency and flexibility. To address these challenges, this paper proposes an emulation system that supports debugging and tuning of large-scale parallel programs by executing parallel programs in the desired scale on a small cluster. The program is firstly executed in the desired scale on the target HPC system to record necessary information; then, programmers can choose and re-execute a subset of processes of the program repeatedly on a small cluster, during which the emulation system controls the execution of the processes, and programmers can debug their programs by attaching tools to the selected processes. Moreover, our system supports popular CPU+GPU heterogeneous architecture. The system is evaluated on a small cluster, while a 1000-node system is used as the target HPC system; experimental results demonstrate the accuracy and efficiency of emulation-execution.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

ELS: Emulation system for debugging and tuning large-scale parallel programs on small clusters

Abstract

Talk to us

Similar Papers

More From: The Journal of Supercomputing

Lead the way for us

Journal: The Journal of Supercomputing	Publication Date: May 23, 2020
Citations: 2

Similar Papers

Re-Running Large-Scale Parallel Programs Using Two Nodes
Yayu Guo ... Yi Liu
-
Yayu Guo, et. al.Yayu Guo ... Yi Liu
01 Dec 2018
01 Dec 2018

Practical simulation of large-scale parallel programs and its performance analysis of the NAS Parallel Benchmarks
Kazuto Kubota ... Mitsuhisa Sato
-
Kazuto Kubota, et. al.Kazuto Kubota ... Mitsuhisa Sato
01 Jan 1998
01 Jan 1998

Testing Path Generation Algorithm with Network Performance Constraints for Nondeterministic Parallel Programs
Wei Wang ... Lejun Zhang
-
Wei Wang, et. al.Wei Wang ... Lejun Zhang
01 Jun 2006
01 Jun 2006

Error detection in large-scale parallel programs with long runtimes
Dieter Kranzlmüller ... Jens Volkert
Future Generation Computer Systems | VOL. 19
Dieter Kranzlmüller, et. al.Dieter Kranzlmüller ... Jens Volkert
12 Apr 2003
Future Generation Computer Systems | VOL. 19

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ELS: Emulation system for debugging and tuning large-scale parallel programs on small clusters

Abstract

Talk to us

Similar Papers

More From: The Journal of Supercomputing