Loop Optimization for Divergence Reduction on GPUs with SIMT Architecture

Roman Novak

doi:10.1109/tpds.2014.2324587

Abstract

The single-instruction multiple thread (SIMT) architecture that can be found in some latest graphical processing units (GPUs) builds on the conventional single-instruction multiple data (SIMD) parallelism while adopting the thread programming model. The architecture suffers from a degraded performance caused by the inefficient divergence handling, a problem hidden by the programmer's view of independent threads. A loop optimization technique having the potential to increase efficiency of the core SIMD block while processing embedded divergences is investigated here. Concurrent loops are generally not bound to iterate in lock-step, allowing better alignment of thread flows via iteration scheduling. The concept efficiency is analyzed for fixed and flow-adapting scheduling policies. The proposed payoff model captures loop overhead implications, allowing one to assess the tradeoffs of applying the technique to a specific loop instance. Processing speedups can generally be observed in the total running time if kernels are compute-bound, as demonstrated by several examples. The studied iteration scheduling policies do not impose alterations to the core SIMD concept and design, thus preserving the benefits of data level parallelism.

Full Text