Exploring AMD GPU Scheduling Details by Experimenting With “Worst Practices”

Nathan Otterness,James H. Anderson

doi:10.1145/3453417.3453432

Abstract

Graphics processing units (GPUs) have been the target of a significant body of recent real-time research, but research is often hampered by the “black box” nature of GPU hardware and software. Now that one GPU manufacturer, AMD, has embraced an open-source software stack, one may expect an increased amount of real-time research to use AMD GPUs. Reality, however, is more complicated. Without understanding where internal details may differ, researchers have no basis for assuming that observations made using NVIDIA GPUs will continue to hold for AMD GPUs. Additionally, the openness of AMD’s software does not mean that their scheduling behavior is obvious, especially due to sparse, scattered documentation. In this paper, we gather the disparate pieces of documentation into a single coherent source that provides an end-to-end description of how compute work is scheduled on AMD GPUs. In doing so, we start with a concrete demonstration of how incorrect management triggers extreme worst-case behavior in shared AMD GPUs. Subsequently, we explain the internal scheduling rules for AMD GPUs, how they led to the “worst practices,” and how to correctly manage some of the most performance-critical factors in AMD GPU sharing.

Full Text