Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work Stealing

Shumpei Shiina,Kenjiro Taura

doi:10.1109/tpds.2022.3196192

Shumpei Shiina, Kenjiro Taura

Open Access

https://doi.org/10.1109/tpds.2022.3196192

Copy DOI

Abstract

Nested (fork-join) parallelism eases parallel programming by enabling high-level expression of parallelism and leaving the mapping between parallel tasks and hardware to the runtime scheduler. A challenge in dynamic scheduling of nested parallelism is how to exploit data locality, which has become more demanding in the deep cache hierarchies of modern processors with a large number of cores. This paper introduces <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">almost deterministic work stealing (ADWS)</i> , which efficiently exploits data locality by deterministically planning a cache-hierarchy-aware schedule, while allowing a little scheduling variety to facilitate dynamic load balancing. Furthermore, we propose an extension of our prior work on ADWS to achieve better shared cache utilization. The improved version of the scheduler is called <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">multi-level ADWS</i> . The idea is that only part of a computation whose working set size is small enough to fit into a shared cache is scheduled by ADWS within the cache recursively, thus avoiding excessive capacity misses. Our evaluation on a benchmark of parallel decision tree construction demonstrated that multi-level ADWS outperformed the conventional random work stealing of Cilk Plus by 61% and it showed a 40% performance improvement over the previous ADWS design.

Full Text