Warp-level divergence in GPUs: Characterization, impact, and mitigation

Ping Xiang,Yi Yang,Huiyang Zhou

doi:10.1109/hpca.2014.6835939

Abstract

High throughput architectures rely on high thread-level parallelism (TLP) to hide execution latencies. In state-of-art graphics processing units (GPUs), threads are organized in a grid of thread blocks (TBs) and each TB contains tens to hundreds of threads. With a TB-level resource management scheme, all the resource required by a TB is allocated/released when it is dispatched to / finished in a streaming multiprocessor (SM). In this paper, we highlight that such TB-level resource management can severely affect the TLP that may be achieved in the hardware. First, different warps in a TB may finish at different times, which we refer to as `warp-level divergence'. Due to TB-level resource management, the resources allocated to early finished warps are essentially wasted as they need to wait for the longest running warp in the same TB to finish. Second, TB-level management can lead to resource fragmentation. For example, the maximum number of threads to run on an SM in an NVIDIA GTX 480 GPU is 1536. For an application with a TB containing 1024 threads, only 1 TB can run on the SM even though it has sufficient resource for a few hundreds more threads. To overcome these inefficiencies, we propose to allocate and release resources at the warp level. Warps are dispatched to an SM as long as it has sufficient resource for a warp rather than a TB. Furthermore, whenever a warp is completed, its resource is released and can accommodate a new warp. This way, we effectively increase the number of active warps without actually increasing the size of critical resources. We present our lightweight architectural support for our proposed warp-level resource management. The experimental results show that our approach achieves up to 76.0% and an average of 16.0% performance gains and up to 21.7% and an average of 6.7% energy savings at minor hardware overhead.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Warp-level divergence in GPUs: Characterization, impact, and mitigation

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

LAS: Locality-Aware Scheduling for GEMM-Accelerated Convolutions in GPUs
Hyeonjin Kim ... William J Song
IEEE Transactions on Parallel and Distributed Systems | VOL. 34
Hyeonjin Kim, et. al.Hyeonjin Kim ... William J Song
01 May 2023
IEEE Transactions on Parallel and Distributed Systems | VOL. 34

Scratchpad Sharing in GPUs
Vishwesh Jatala ... Amey Karkare
ACM Transactions on Architecture and Code Optimization | VOL. 14
Vishwesh Jatala, et. al.Vishwesh Jatala ... Amey Karkare
26 May 2017
ACM Transactions on Architecture and Code Optimization | VOL. 14

Taming warp divergence
...
-
, et. al. ...
04 Feb 2017
04 Feb 2017

Taming warp divergence
Jayvant Anantpur ... R Govindarajan
-
Jayvant Anantpur, et. al.Jayvant Anantpur ... R Govindarajan
01 Feb 2017
01 Feb 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Warp-level divergence in GPUs: Characterization, impact, and mitigation

Abstract

Talk to us

Similar Papers