Abstract

Deep learning (DL) jobs normally run on GPU clusters. Some DL jobs need to be scheduled preemptively to avoid long waiting times. However, preempting a DL job is time-consuming, which consists of suspending and resuming. Suspending needs to complete the training process of the current epoch, and resuming needs to reload the model and the training data. The existing schedulers almost do not consider the overhead of preempting jobs; thus, they may preempt jobs with large time loss, increasing the waiting time and the makespan.In this paper, we present PickyMan, a preemptive scheduler to minimize the overhead of preempting jobs to reduce the average waiting time and the makespan. PickyMan has some innovations. (1) Predict execution time using network traffic and database. It predicts the execution time of a DL job by profiling the network traffic from storage nodes to computation nodes and using a database, without the requirement of allocating extra resources from the cluster. It can use profiled information of only four jobs to predict the execution times for other same-model jobs, and most of the predicted errors are less than 10%. (2) Modeling the overhead of preemption. It builds a model to predict the time loss of job suspensions and resumptions with an average error of less than 5%. (3)We abstract the problem of choosing the appropriate jobs for preemption as one of finding an ordered division of the set of running jobs and solve it quickly with a greedy algorithm. By conducting experiments on the small-scale actual cluster and making large-scale simulations, PickyMan reduces the average waiting time by 10%–92% and further reduces the makespan by up to 14%, compared to existing methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.