Abstract

Graphics processing units (GPUs) employ the single instruction multiple data (SIMD) hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow. Threads running concurrently within a warp may jump to different paths after conditional branches. Such divergent control flow makes some lanes idle and hence reduces the SIMD utilization of GPUs. To alleviate the waste of SIMD lanes, threads from multiple warps can be collected together to improve the SIMD lane utilization by compacting threads into idle lanes. However, this mechanism induces extra barrier synchronizations since warps have to be stalled to wait for other warps for compactions, resulting in that no warps are scheduled in some cases. In this paper, we propose an approach to reduce the overhead of barrier synchronizations induced by compactions. In our approach, a compaction is bypassed by warps whose threads all jump to the same path after branches. Moreover, warps waiting for a compaction can also bypass this compaction when no warps are ready for issuing. In addition, a compaction is canceled if idle lanes can not be reduced via this compaction. The experimental results demonstrate that our approach provides an average improvement of 21% over the baseline GPU for applications with massive divergent branches, while recovering the performance loss induced by compactions by 13% on average for applications with many non-divergent control flows.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call