Exploring compiler optimization opportunities for the OpenMP 4.x accelerator model on a POWER8+GPU platform

Akira Hayashi ,Jun Shirako ,Robert Ho ,Ettore Tiotto ,Vivek Sarkar

doi:10.5555/3019120.3019127

Abstract

While GPUs are increasingly popular for high-performance computing, optimizing the performance of GPU programs is a time-consuming and non-trivial process in general. This complexity stems from the low abstraction level of standard GPU programming models such as CUDA and OpenCL: programmers are required to orchestrate low-level operations in order to exploit the full capability of GPUs. In terms of software productivity and portability, a more attractive approach would be to facilitate GPU programming by providing high-level abstractions for expressing parallel algorithms.OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP's high-level parallel abstractions with accelerator programming. This extension allows programmers to write GPU programs in standard C/C++ or Fortran languages, without exposing too many details of GPU architectures.However, such high-level parallel programming strategies generally impose additional program optimizations on compilers, which could result in lower performance than fully hand-tuned code with low-level programming models. To study potential performance improvements by compiling and optimizing high-level GPU programs, in this paper, we 1) evaluate a set of OpenMP 4.× benchmarks on an IBM POWER8 and NVIDIA Tesla GPU platform and 2) conduct a comparable performance analysis among hand-written CUDA and automatically-generated GPU programs by the IBM XL and clang/LLVM compilers.

Full Text