Abstract

The CUDA and OpenCL programming models have facilitated the widespread adoption of general-purpose GPU programming for data-parallel applications. GPUs accelerate these applications by assigning groups of threads to SIMD units, which execute the same instruction for all threads in a group. Individual group threads might diverge and follow different paths of execution. Divergent branches cause performance degradation by under-utilizing the execution pipeline, resulting in a major performance bottleneck. The presence of unstructured control flow in addition to divergent branches causes further degradation, since it results in repeated execution of instructions. In this paper, we propose a transformation which converts unstructured to structured control flow. It only creates tail-controlled loops, and properly nests all control flow splits and joins by inserting predicates. We implement an additional pass to NVIDIA's CUDA compiler to experimentally evaluate our transformation using synthetic unstructured control flow graphs, as well as kernels in the Rodinia benchmark suite. Our approach effectively eliminates redundant execution and potentially improves execution time for the synthetic unstructured control flow graphs. For the kernels in the benchmark suite, it only adds a minor, average overhead of 2.1% to the execution time of already structured kernels, and reduces execution time for the only unstructured kernel by a factor of five. The representational overhead at compile-time is linear in terms of instructions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call