Abstract

Modern GPU Streaming Multiprocessors (SMs) have several warp schedulers, execution units, and register file banks. To reduce area and energy-consumption, recent generations divide SMs into sub-cores. Each sub-core contains a distinct warp scheduler, register file, and execution units, sharing L1 memory and scratchpad resources with sub-cores in the same SM. Although partitioning the SM into sub-cores decreases the area and energy demands of larger SMs, it comes at a performance cost. Warps assigned to the SM have access to a fraction of the SM’s resources, resulting in contention and imbalance issues. In this paper, we examine the effect SM sub-division has on performance and propose novel mechanisms to mitigate the negative impacts. We identify four orthogonal effects caused by sub-dividing SMs and demonstrate that two of these effects have a significant impact on performance in practice. Based on these findings, we propose register-bank-aware warp scheduling to avoid bank conflicts that arise when instruction operands are placed in the limited number of register file banks available to each sub-core, and randomly hashed sub-core assignment to mitigate imbalance issues. Our intelligent scheduling mechanisms result in an average 11.2% speedup across a diverse set of applications capturing 81% of the performance lost to SM sub-division.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call