Abstract

Improving the latency hiding ability is important for GPU performance. Although existing works, which mainly target on either improving thread level parallelism or optimizing memory hierarchy, are effective at improving GPUs’ latency hiding ability, warps are still blocked after executing long latency operations, reducing the number of schedulable warps. This article revisits the recently proposed non-blocking execution for GPUs to improve the latency hiding ability of GPUs. With non-blocking execution, instructions from warps blocked by long latency operations can be pre-executed to make full use of GPU resources. However, we find that the state-of-the-art non-blocking GPU architecture gains limited performance improvement. Through in-depth analysis, we observe that the poor performance is largely due to inefficient pre-execution state management, duplicate instruction extraction, frequent early eviction and severe resource congestion. To make non-blocking execution actually useful for GPUs and minimize hardware overheads, we carefully redesign the non-blocking architecture for GPUs based on our analysis and propose <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MIPSGPU</i> . Our evaluations show that <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MIPSGPU</i> , relative to the state-of-the-art non-blocking GPU architecture, improves performance of memory intensive applications by 19.05 percent, and reduces memory to SM traffics by 14 percent.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call