TwinKernels: An execution model to improve GPU hardware scheduling at compile time

Xiang Gong,Amir Kavyan Ziabari,David Kaeli,Rafael Ubal,Zhongliang Chen

doi:10.1109/cgo.2017.7863727

Abstract

As throughput-oriented accelerators, GPUs provide tremendous processing power by running a massive number of threads in parallel. However, exploiting high degrees of thread-level parallelism (TLP) does not always translate to the peak performance that GPUs can offer, leaving the GPU's resources often under-utilized. Compared to compute resources, memory resources can tolerate considerably lower levels of TLP due to hardware bottlenecks. Unfortunately, this tolerance is not effectively exploited by the Single Instruction Multiple Thread (SIMT) execution model employed by current GPU compute frameworks. Assuming an SIMT execution model, GPU applications tend to send bursts of memory requests that compete for GPU memory resources. Traditionally, hardware units, such as the wavefront scheduler, are used to manage such requests. However, the scheduler struggles when the number of computational operations are too low to effectively hide the long latency of memory operations. In this paper, we propose a Twin Kernel Multiple Thread (TKMT) execution model, a compiler-centric solution that improves hardware scheduling at compile time. TKMT better distributes the burst of memory requests in some of the wavefronts through static instruction scheduling. Our results show that TKMT can offer a 12% average improvement over the baseline SIMT implementation on a variety of benchmarks on AMD Radeon systems.

Full Text