Multilevel interference-aware scheduling on modern GPUs

Leiming Yu

doi:10.17760/d20317872

Abstract

Driven by their impressive parallel processing capabilities, Graphics Processing Units (GPUs) have become the accelerator of choice for high-performance computing. Many data-parallel applications have enjoyed significant speedups after being re-engineered to leverage the thousands of cores on the GPU. For instance, training a complex deep neural network model on a GPU can be done within hours, versus the weeks of time it might take on more traditional CPUs. While most deep neural networks are hungry for more and more computing resources, a number of application kernels only use a fraction of the available resources. To better utilize the massive resources on the GPU, device vendors have started to support Concurrent Kernel Execution (CKE). The Hyper-Q technology from NVIDIA allows up to 32 data-independent kernels to run concurrently, leveraging parallel hardware work queues. These hardware work queues can execute concurrent kernels from either a single GPU context or multiple GPU contexts. With support for concurrent kernel execution, multiple applications can be co-located and co-scheduled on the same GPU, significantly improving resource utilization. The application throughput provided by CKE is subject to a number of factors, including the kernel configuration attributes, the dynamic behavior of each kernel (e.g., compute-intensive vs. memory-intensive), the kernel launch order and inter-kernel dependencies, etc. Launching more concurrent kernels does not always achieve better performance. It is challenging to predict the potential performance benefits of using CKE. Typically, a developer will have to compile and run their program many times to obtain the best performance. In addition, as multiple GPU applications co-scheduled on the device, the contentions for shared resources, such as memory bandwidth and computational pipelines, result in interference which can often impact the CKE performance. In this thesis, we seek to optimize the execution efficiency for GPU workloads at a kernel granularity, as well as at an application granularity. We focus on providing a performance tuning mechanism for concurrent kernel execution and develop an efficient GPU workload scheduler to achieve improved quality-of-service in a cloud environment. We have developed an empirical model named Moka, to estimate the performance benefits using concurrent kernel execution. The model analyzes a non-CKE application comprising multiple kernels, using the profiling information. It delivers an estimate of the performance ceiling by taking into account data transfers and GPU kernel execution behavior. Moka also provides guidance to find the best performing kernel-stream mapping, quickly identifying the best CKE configuration, resulting in improved performance and the highest utilization of the GPU. In addition, a machine-learning based interference-aware scheduler named Magic was developed to improve the system throughput for multitasking on GPUs. Magic framework implements offline short profiling analysis to study the important interference metrics and conducts interference sensitivity prediction for GPU workloads based on the selected machine learning models. Our scheduler outperforms a state-of-art similarity-based scheduler on a single GPU system and achieves a high system throughput compared to the least-loaded policy on a multi-GPU system.

Full Text