Efficient utilization of launched threads on GPUs: The spherical harmonic transform as a case study

Feng-Shun Lu,Jun-Qiang Song,Wang-Qun Lin,Yu-Fei Pang,Kai-Jun Ren,Pei-Chang Shi

doi:10.1016/j.cpc.2013.06.019

Abstract

Maximum utilization of hardware resources is crucial to leverage the enormous computational power of graphics processing units (GPUs). However, there lacks an effective metric to denote whether the launched threads are kept busy. To address this issue, we propose a metric called ETU to describe the efficiency of threads utilization. First, we execute several CUDA-SDK sample codes, with(out) double precision arithmetic, on two generations of GPUs so as to perform a preliminary validation of the ETU metric. Taking the spherical harmonic transform as an example, we then give two GPU implementations for Legendre transforms and check the relationship between ETU and application performance. Experimental results show that applications with larger ETU can usually achieve better performance, which is more accurate than occupancy proposed by NVIDIA. Finally, we select the GPU implementations with better performance to accelerate Legendre transforms in STSWM, which is a spectral transform shallow water model.

Full Text