AbstractThe past decade witnessed a remarkable increase in deep learning (DL) workloads which require GPU resources to accelerate the training process. However, the existing coarse‐grained scheduling mechanisms are agnostic to information other than the number of GPUs or GPU memory, which results in performance degradation of DL tasks. Moreover, the common assumption held by the existing balance‐aware DL task scheduling strategies, a DL task consumes resources once it starts, fails to reduce resource contention, and further limits execution efficiency. To address these problems, this article proposes a fine‐grained and balance‐aware scheduling model (FBSM) which considers the resource consumption characteristic of the DL task. Based on FBSM, we propose customized GPU sniffer (GPU‐S) and balance‐aware scheduler (BAS) modules to construct a scheduling system called KubFBS. The experimental results demonstrate KubFBS accelerates the execution of DL tasks while improving the load balancing capability of the cluster.
Read full abstract