Abstract
Modern graphics processing units (GPUs) have became powerful and cost-effective computing platforms. Parallel programming standards (e.g. CUDA) and directive-based programming standards (like OpenHMPP and OpenACC) are available to harness this tremendous computing power to tackle largescale modelling and simulation in scientific areas. ANUGA is a tsunami modelling application which is based on unstructured triangular meshes and implemented in Python/C. This paper explores issues in porting and optimizing a Python/C-based unstructured mesh application to GPUs. Two paradigms are compared: CUDA via the PyCUDA API, involving writing GPU kernels, and OpenHMPP, involving adding directives to C code. In either case, the ‘naive’ approach of transferring unstructured mesh data to the GPU for each kernel resulted in an actual slowdown over single core performance on a CPU. Profiling results confirmed that this is due to data transfer times of the device to/from the host, even though all individual kernels achieved a good speedup. This necessitated an advanced approach, where all key data structures are mirrored on the host and the device. For both paradigms, this in turn involved converting all code updating these data structures to CUDA (or directive-augmented C, in the case of OpenHMPP). Furthermore, in the case of CUDA, the porting can no longer be done incrementally: all changes must be made in a single step. For debugging, this makes identifying which kernel(s) that have introduced bugs very difficult. To alleviate this, we adopted the relative debugging technique to the host-device context. Here, when in debugging mode, the mirrored data structures are updated upon each step on both the host (using the original serial code) and the device, with any discrepancy being immediately detected. We present a generic Python-based implementation of this technique. With this approach, the CUDA version achieved 2x speedup, and the OpenHMPP achieved 1.6x. The main optimization of unstructured mesh rearrangement to achieve coalesced memory access patterns contributed to 10% of the former. In terms of productivity, however, OpenHMPP achieved significantly better speedup per hour of programming effort.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.