Abstract
Next-generation HPC computing platforms are likely to be characterized by significant, unpredictable nonuniformities in execution time among compute nodes and cores. The resulting load imbalances from this nonuniformity are expected to arise from a variety of sources—manufacturing discrepancies, dynamic power management, runtime component failure, OS jitter, software-mediated resiliency, and TLB/- cache performance variations, for example. It is well understood that existing algorithms with frequent points of bulk synchronization will perform relatively poorly in the presence of these sources of process nonuniformity. Thus, recasting classic bulk synchronous algorithms into more asynchronous, coarse-grained parallelism is a critical area of research for next-generation computing. We propose a class of parallel algorithms for explicit stencil computations that can tolerate these nonuniformities by decoupling per process communication and computation in order for each process to progress asynchronously while maintaining solution correctness. These algorithms are benchmarked with a 1D domain decomposed (“slabbed”) implementation of the 2D heat equation as a model problem, and are tested in the presence of simulated nonuniform process execution rates. The resulting performance is compared to a classic bulk synchronous implementation of the model problem. Results show that the runtime of this article’s algorithm on a machine with simulated process nonuniformities is 5--99% slower than the runtime of its classic counterpart on a machine free of nonuniformities. However, when both algorithms are run on a machine with comparable synthetic process nonuniformities, this article’s algorithm is 1--37 times faster than its classic counterpart.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have