PickyMan: A Preemptive Scheduler for Deep Learning Jobs on GPU Clusters

Chen Chen,Jianchen Han,Guangtao Xue,Zhaoyun Chen,Yingwen Chen

doi:10.1109/ipccc55026.2022.9894345

Abstract

Deep learning (DL) jobs normally run on GPU clusters. Some DL jobs need to be scheduled preemptively to avoid long waiting times. However, preempting a DL job is time-consuming, which consists of suspending and resuming. Suspending needs to complete the training process of the current epoch, and resuming needs to reload the model and the training data. The existing schedulers almost do not consider the overhead of preempting jobs; thus, they may preempt jobs with large time loss, increasing the waiting time and the makespan.In this paper, we present PickyMan, a preemptive scheduler to minimize the overhead of preempting jobs to reduce the average waiting time and the makespan. PickyMan has some innovations. (1) Predict execution time using network traffic and database. It predicts the execution time of a DL job by profiling the network traffic from storage nodes to computation nodes and using a database, without the requirement of allocating extra resources from the cluster. It can use profiled information of only four jobs to predict the execution times for other same-model jobs, and most of the predicted errors are less than 10%. (2) Modeling the overhead of preemption. It builds a model to predict the time loss of job suspensions and resumptions with an average error of less than 5%. (3)We abstract the problem of choosing the appropriate jobs for preemption as one of finding an ordered division of the set of running jobs and solve it quickly with a greedy algorithm. By conducting experiments on the small-scale actual cluster and making large-scale simulations, PickyMan reduces the average waiting time by 10%–92% and further reduces the makespan by up to 14%, compared to existing methods.

Full Text