Demystifying the 16 × 16 thread‐block for stencils on the GPU

Siham Tabik,Maurice Peemen,Henk Corporaal,Nicolas Guil

doi:10.1002/cpe.3591

Abstract

SummaryStencil computation is of paramount importance in many fields, in image processing, structural biology and biomedicine, among others. There exists a permanent demand of maximizing the performance of stencils on state‐of‐the‐art architectures, such graphics processing units (GPUs). One of the important issues when optimizing these kernels for the GPU is the selection of the best thread‐block that maximizes the overall performance. Usually, programmers look for the optimal thread‐block configuration in a reduced space of square thread‐block configurations or simply use the best configurations reported in previous works, which is usually 16 × 16. This paper provides a better understanding of the impact of thread‐block configurations on the performance of stencils on the GPU. In particular, we model locality and parallelism and consider that the optimal configurations are within the space that provides: (1) a small number of global memory communications; (2) a good shared memory utilization with small numbers of conflicts; (3) a good streaming multi‐processors utilization; and (4) a high efficiency of the threads within a thread‐block. The model determines the set of optimal thread‐block configurations without the need of executing the code. We validate the proposed model using six stencils with different halo widths and show that it reduces the optimization space to around 25% of the total valid space. The configurations in this space achieve at least a throughput of 75% of the best configuration and guarantee the inclusion of the best configurations. Copyright © 2015 John Wiley & Sons, Ltd.

Full Text