MIPSGPU: Minimizing Pipeline Stalls for GPUs With Non-Blocking Execution

Chao Yu,Rui Wang,Yuebin Bai

doi:10.1109/tc.2020.3026043

Chao Yu, Rui Wang + Show 1 more

Open Access

https://doi.org/10.1109/tc.2020.3026043

Copy DOI

Journal: IEEE Transactions on Computers	Publication Date: Nov 1, 2021
Citations: 1	License type: publisher-specific, author manuscript

Affiliation: Institute of Software, Beihang University

Abstract

Improving the latency hiding ability is important for GPU performance. Although existing works, which mainly target on either improving thread level parallelism or optimizing memory hierarchy, are effective at improving GPUs’ latency hiding ability, warps are still blocked after executing long latency operations, reducing the number of schedulable warps. This article revisits the recently proposed non-blocking execution for GPUs to improve the latency hiding ability of GPUs. With non-blocking execution, instructions from warps blocked by long latency operations can be pre-executed to make full use of GPU resources. However, we find that the state-of-the-art non-blocking GPU architecture gains limited performance improvement. Through in-depth analysis, we observe that the poor performance is largely due to inefficient pre-execution state management, duplicate instruction extraction, frequent early eviction and severe resource congestion. To make non-blocking execution actually useful for GPUs and minimize hardware overheads, we carefully redesign the non-blocking architecture for GPUs based on our analysis and propose <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MIPSGPU</i> . Our evaluations show that <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MIPSGPU</i> , relative to the state-of-the-art non-blocking GPU architecture, improves performance of memory intensive applications by 19.05 percent, and reduces memory to SM traffics by 14 percent.

Full Text