Optimizations in GPU: Smart compilers and core-level reconfiguration

Deming Chen

doi:10.1109/slip.2013.6681686

Abstract

Summary form only given. Graphics processing units (GPUs) are increasingly critical for general-purpose parallel processing performance. GPU hardware is composed of many streaming multiprocessors, allowing GPUs to execute tens of thousands of threads in parallel. However, due to the SIMD (single-instruction multiple-data) execution style, resource utilization and thus overall performance can be significantly affected if computation threads must take diverging control paths. Meanwhile, tuning GPU applications' performance is also a complex and labor intensive task. Software programmers employ a variety of optimization techniques to explore tradeoffs between the thread parallelism and performance of a single thread. New GPU architecture also allows concurrent kernel executions which introduces interesting kernel scheduling problems. In the first part of the talk, we will mainly introduce our recent studies on control flow optimization, joint optimization of register allocation and thread structure, and concurrent kernel scheduling, for GPU performance improvements. Energy efficiency of GPUs for general-purpose computing is increasingly important as well. The integration of GPUs onto SoCs for use in mobile devices in the last 5 years has further exacerbated the need to reduce the energy foot print of GPUs. In the second part of the talk, we propose a novel GPU architecture that makes use of reconfiguration to exploit ILP and DVFS (Dynamic Voltage and Frequency Scaling) techniques to reduce the power consumption, without sacrificing the computational throughput. We expect that applications with large amounts of ILP should see dramatic improvements in their energy and power, when compared to nominal CUDA-based architectures. In addition to this, we foresee interesting challenges with respect to scheduling of threads and the re-organization of CUDA warp structures and schedules. We also note that dynamic reconfiguration of cores within a SIMD unit (SM in CUDA), affects the number of threads that can execute concurrently and thus would change the number of effective warps in flight, which may affect the capability to overlap execution time and memory latency.

Full Text