RIFLING: A reinforcement learning‐based GPU scheduler for deep learning research and development platforms

Zhaoyun Chen

doi:10.1002/spe.3066

Abstract

AbstractGPU platforms have been widely adopted in both academia and industry to support deep learning (DL) research and development (R&D). Compared with giant companies who favor custom‐designed AI platforms, most small‐and‐medium‐sized enterprises, institutes and universities (EIUs) prefer to build or rent a cost‐effective GPU cluster, usually in a limited‐scale, to process diverse DL R&D workloads. Therefore, more attention has been attracted by DL scheduling with the aim of improving the system efficiency and task performance. However, prior prediction‐based schedulers are limited in terms of their prediction accuracy and profiling overhead. Accordingly, in this article, we propose a reinforcement learning (RL)‐based online GPU scheduler, RIFLING, to model the scheduling problem as an online decision‐making process. Scheduling decisions are made according to Q‐learning, which is a typical RL method. RIFLING can achieve high scheduling efficiency based on the online exploring and exploiting of diverse scheduling strategies for various DL workloads, without the need for expensive offline profiling or sophisticated prediction model. We implement RIFLING as a plugin of Tensorflow, and deploy it on a distributed GPU cluster. Experiments demonstrate that RIFLING achieves up to 47.8% reductions and 19.6% improvements in makespan and average normalized processing rate respectively compared to the best available baseline without any manual intervention.

Full Text