Acceleration of a Python-Based Tsunami Modelling Application via CUDA and OpenHMPP

Zhe Weng,Peter E Strazdins

doi:10.1109/ipdpsw.2014.143

Zhe Weng, Peter E Strazdins

PDF Available

https://doi.org/10.1109/ipdpsw.2014.143

Copy DOI

Export

Save

Cite

Publication Date: May 1, 2014

Citations: 3

Affiliation: Australian National University

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Modern graphics processing units (GPUs) have became powerful and cost-effective computing platforms. Parallel programming standards (e.g. CUDA) and directive-based programming standards (like OpenHMPP and OpenACC) are available to harness this tremendous computing power to tackle largescale modelling and simulation in scientific areas. ANUGA is a tsunami modelling application which is based on unstructured triangular meshes and implemented in Python/C. This paper explores issues in porting and optimizing a Python/C-based unstructured mesh application to GPUs. Two paradigms are compared: CUDA via the PyCUDA API, involving writing GPU kernels, and OpenHMPP, involving adding directives to C code. In either case, the ‘naive’ approach of transferring unstructured mesh data to the GPU for each kernel resulted in an actual slowdown over single core performance on a CPU. Profiling results confirmed that this is due to data transfer times of the device to/from the host, even though all individual kernels achieved a good speedup. This necessitated an advanced approach, where all key data structures are mirrored on the host and the device. For both paradigms, this in turn involved converting all code updating these data structures to CUDA (or directive-augmented C, in the case of OpenHMPP). Furthermore, in the case of CUDA, the porting can no longer be done incrementally: all changes must be made in a single step. For debugging, this makes identifying which kernel(s) that have introduced bugs very difficult. To alleviate this, we adopted the relative debugging technique to the host-device context. Here, when in debugging mode, the mirrored data structures are updated upon each step on both the host (using the original serial code) and the device, with any discrepancy being immediately detected. We present a generic Python-based implementation of this technique. With this approach, the CUDA version achieved 2x speedup, and the OpenHMPP achieved 1.6x. The main optimization of unstructured mesh rearrangement to achieve coalesced memory access patterns contributed to 10% of the former. In terms of productivity, however, OpenHMPP achieved significantly better speedup per hour of programming effort.

Full Text