A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes

Hamidreza Khaleghzadeh,Alexey Lastovetsky,Ravi Reddy Manumachu

doi:10.1109/access.2019.2959905

Hamidreza Khaleghzadeh, Alexey Lastovetsky + Show 1 more

Open Access

https://doi.org/10.1109/access.2019.2959905

Copy DOI

Journal: IEEE Access	Publication Date: Dec 26, 2019
Citations: 39	License type: CC BY 4.0

Affiliation: University College Dublin

Abstract

Modern HPC platforms are highly heterogeneous with tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to address the twin critical concerns of performance and energy efficiency. Due to this inherent characteristic, processing elements contend for shared on-chip resources such as Last Level Cache (LLC), interconnect, etc. and shared nodal resources such as DRAM, PCI-E links, etc., resulting in complexities such as resource contention, non-uniform memory access (NUMA), and accelerator-specific limitations such as limited main memory thereby necessitating support for efficient out-of-card execution. Due to these complexities, the performance profiles of data-parallel applications executing on these platforms are not smooth and deviate significantly from the shapes that allowed state-of-the-art load-balancing algorithms to find optimal solutions. In this paper, we propose a hierarchical two-level data partitioning algorithm minimizing the parallel execution time of data-parallel applications on clusters of $h$ identical nodes where each node has $c$ heterogeneous processors. This algorithm takes as input $c$ discrete speed functions of cardinality $m$ corresponding to the $c$ heterogeneous processors. It does not make any assumptions about the shapes of these functions. Unlike load balancing algorithms, optimal solutions found by the algorithm may not load-balance an application in terms of execution time. The proposed algorithm has low time complexity of $O(m^{2} \times h + m^{3} \times c^{3})$ unlike the state-of-the-art algorithm solving the same problem with the complexity of $O(m^{3} \times c^{3} \times h^{3})$ . We also propose an extension of the algorithm for clusters of $h$ non-identical nodes where each node has $c$ heterogeneous processors. We experimentally demonstrate the optimality of our algorithm using two well-known and highly optimized multi-threaded data-parallel applications, matrix-matrix multiplication and 2D fast Fourier transform, on a heterogeneous multi-accelerator NUMA node containing an Intel multicore Haswell CPU, an Nvidia K40c GPU, and an Intel Xeon Phi co-processor and a simulated homogeneous cluster of such nodes.

Highlights

Modern HPC platforms have become highly heterogeneous owing to the tight integration of multicore CPUs and accelerators (such as Graphics Processing Units (GPUs), Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to address the twin critical concerns of performance
EXPERIMENTAL RESULTS We experimentally study the performance of HiPOPTA compared with HPOPTA and load-balancing algorithms using two well-known multi-threaded data-parallel applications, matrix-matrix multiplication (DGEMM) and 2D fast Fourier transform (2D-FFT)
The HCLServer node consists of one Intel Haswell CPU, one Nvidia K40c GPU and one Intel Xeon Phi 3120P

Summary

Introduction

Modern HPC platforms have become highly heterogeneous owing to the tight integration of multicore CPUs and accelerators (such as Graphics Processing Units (GPUs), Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to address the twin critical concerns of performance. The Top500 list [1] contains about 138 systems with accelerators; NVIDIA’s Tesla P100 and Tesla V100 account for about 100 systems. All the top 25 computers in the Green500 list [2] contain accelerators. Optimization of data-parallel applications for performance on such platforms faces several challenges due to the inherent complexities introduced by the tight integration of the compute devices.

Results

Discussion

Conclusion