Abstract

Exascale computing systems will exhibit high degrees of hierarchical parallelism, with thousands of computing nodes and hundreds of cores per node. Efficiently exploiting hierarchical parallelism is challenging due to load imbalance that arises at multiple levels. OpenMP is the most widely-used standard for expressing and exploiting the ever-increasing node-level parallelism. The scheduling options in OpenMP are insufficient to address the load imbalance that arises during the execution of multithreaded applications. The limited scheduling options in OpenMP hinder research on novel scheduling techniques which require comparison with others from the literature. This work introduces LB4OMP, an open-source dynamic load balancing library that implements successful scheduling algorithms from the literature. LB4OMP is a research infrastructure designed to spur and support present and future scheduling research, for the benefit of multithreaded applications performance. Through an extensive performance analysis campaign, we assess the effectiveness and demystify the performance of all loop scheduling techniques in the library. We show that, for numerous applications-systems pairs, the scheduling techniques in LB4OMP outperform the scheduling options in OpenMP. Node-level load balancing using LB4OMP leads to reduced cross-node load imbalance and to improved MPI+OpenMP applications performance, which is critical for Exascale computing.

Highlights

  • O N the road to Exascale, we observe that modern and future high performance computing (HPC) systems combine an increasing number of computing nodes and, in particular, cores per node

  • This work is significant by bridging the gap between the state-of-the-art and the state-of-the-practice of load balancing in multithreaded applications

  • Two microbenchmarks, and three computing node types to evaluate the performance of the existing in LLVM OpenMP runtime library (RTL) and newly implemented dynamic loop self-scheduling (DLS) techniques in LB4OMP

Read more

Summary

INTRODUCTION

O N the road to Exascale, we observe that modern and future high performance computing (HPC) systems combine an increasing number of computing nodes and, in particular, cores per node. Applications using LB4OMP benefit from improved performance due to the portfolio of DLS techniques, of which certain techniques adapt during execution to unpredictable variations in application and systemic characteristics (see Section 3.1) These 14 techniques are selected to cover a broad spectrum of dynamic (and adaptive) scheduling techniques. The novelty of this work lies in providing a standalone and unified implementation of efficient scheduling techniques from literature, which is needed to spur new research in scheduling and load balancing for Exascale systems. This work is significant by bridging the gap between the state-of-the-art and the state-of-the-practice of load balancing in multithreaded applications This will allow the large degrees of heterogeneous node-level parallelism in today’s pre- and upcoming Exascale systems to be efficiently exploited for improving applications performance.

RELATED WORK
THE LB4OMP LIBRARY
Dynamic Loop Scheduling Techniques
Features for Performance Measurement
Load Balancing Applications with LB4OMP
PERFORMANCE RESULTS AND DISCUSSION
Design of Factorial Experiments
Performance Analysis
Impact of Chunk Parameter Choice
Influence of Chunk Size Progression
CONCLUSIONS AND FUTURE WORK
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.