Countering Load-to-Use Stalls in the NVIDIA Turing GPU

Ram Rangan,Naman Turakhia,Alexandre Joly

doi:10.1109/mm.2020.3012514

Abstract

Among its various improvements over prior NVIDIA GPUs, the NVIDIA Turing GPU boasts of four key performance enhancements to effectively counter memory load-to-use stalls. First, reduced latency on L1 hits for global memory loads helps lower average memory lookup latency. Next, the ability to dynamically configure the L1 data RAM between cacheable memory and scratchpad or shared memory, enables driver software to opportunistically maximize L1 data cache size for programs with low shared memory requirements, increasing L1 hits and reducing load-to-use stalls. Finally, the twin enhancements of doubling of vector register file capacity and the addition of a dedicated scalar or uniform register file along with a uniform datapath, ease vector register pressure and enable higher warp level parallelism, leading to better latency tolerance. We find that the above enhancements combined deliver an average speedup of 11% on modern gaming applications.

Full Text