Abstract

AbstractWhile there has been a growing interest in supporting accelerators, especially GPU accelerators, in large-scale systems, the user typically has to work with low-level GPU programming models such as CUDA along with the low-level message passing interface (MPI).We believe higher-level programming models such as Partitioned Global Address Space (PGAS) programming models enable productive parallel programming at both the intra-node and inter-node levels in homogeneous and heterogeneous nodes. However, GPU programming with PGAS languages in practice is still limited since there is still a big performance gap between compiler-generated GPU code and hand-tuned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Thus, it is not rare that the user eventually writes a fully external GPU program that includes the host part -i.e., GPU memory (de)allocation, host-device/device-host data transfer, and the device part - i.e., GPU kernels, and calls it from their primary language, which is not very productive.Our key observation is that the complexity of writing the external GPU program comes not only from writing GPU kernels in the device part, but also from writing the host part. In particular, interfacing objects in the primary language to raw C/C++ pointers is tedious and error-prone, especially because high-level languages usually have a well-defined type system with type inference.In this paper, we introduce the GPUAPI module, which offers multiple abstraction levels of low-level GPU API routines for high-level programming models with a special focus on PGAS languages, which allows the user to choose an appropriate abstraction level depending on their tuning scenarios. The module is also designed to work with multiple standard low-level GPU programming models: CUDA, HIP, DPC++, and SYCL, thereby significantly improving productivity and portability.We use Chapel as the primary example and our preliminary performance and productivity evaluations show that the use of the GPUAPI module significantly simplifies GPU programming in a high-level programming model like Chapel, while targeting different multi-node CPUs+GPUs platforms with no performance loss.KeywordsGPUsChapelPGAS languagesDistributed programming modelGPU API library

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call