Balancing Scalar and Vector Execution on GPU Architectures

Zhongliang Chen,David Kaeli

doi:10.1109/ipdps.2016.74

Abstract

Graphics Processing Units (GPUs) have evolved to become high performance processors for general purpose data-parallel applications. Most GPU execution exploits a Single Instruction Multiple Data (SIMD) model. Typically, little attention is paid to whether the input data to the SIMD lanes are the same or different. We have observed that a significant number of SIMD instructions demonstrate scalar characteristics, i.e., they operate on the same data across their active lanes. Treating them as normal SIMD instructions results in redundant and inefficient GPU execution. To better serve both scalar and vector operations, we propose a novel scalar-vector GPU architecture. Our specialized scalar pipeline handles scalar instructions efficiently with only a single copy of the data, freeing the SIMD pipeline for normal vector execution. We propose a novel synchronization scheme to resolve data dependencies between scalar and vector instructions. With our optimized warp scheduling and instruction dispatching schemes, the scalar-vector GPU architecture achieves performance improvements of 19% on average in the Parboil and Rodinia benchmarks suites. We also examine the effects of varying warp sizes on scalar-vector execution and explore subwarp execution for power efficiency. Our results show that, on average, power is reduced by 18%.

Full Text