Abstract

This paper proposes Hermes, a container-based preemptive GPU scheduling framework for accelerating hyper-parameter optimization in deep learning (DL) clusters. Hermes accelerates hyper-parameter optimization by time-sharing between DL jobs and prioritizing jobs with more promising hyper-parameter combinations. Hermes’s scheduling policy is grounded on the observation that good hyper-parameter combinations converge quickly in the early phases of training. By giving higher priority to fast-converging containers, Hermes’s GPU preemption mechanism can accelerate training. This enables users to find optimal hyper-parameters faster without losing the progress of a container. We have implemented Hermes over Kubernetes and compared its performance against existing scheduling frameworks. Experiments show that Hermes reduces the time for hyper-parameter optimization up to 4.04 times against previously proposed scheduling policies such as FIFO, round-robin (RR), and SLAQ, with minimal time-sharing overhead.

Highlights

  • Deep learning (DL) has recently seen immense success in various fields, such as computer vision and natural language processing

  • We evaluate the performance of Hermes using the convolutional neural network (CNN) benchmark [14] from TensorFlow

  • Performance results show that Hermes shortens the hyper-parameter optimization process up to 4.04 times with minimal time-sharing overhead when compared against scheduling policies supported by other cluster managers such as FIFO, round-robin (RR), and SLAQ

Read more

Summary

Introduction

Deep learning (DL) has recently seen immense success in various fields, such as computer vision and natural language processing. In kill-based preemption, users continuously monitor the convergence of DL training jobs and manually kill the containers not making progress. SLAQ [12] prioritizes low-performing DL jobs by receiving convergence feedback This can harm the performance of hyper-parameter optimization by executing unpromising hyper-parameter combinations first. Convergence-aware scheduling policy: Hermes accelerates hyper-parameter optimization by prioritizing jobs based on the convergence speed. This sharply contrasts to the Gandiva’s approach [7] since not all tasks are trained but important tasks are selected and accelerated. Performance results show that Hermes shortens the hyper-parameter optimization process up to 4.04 times with minimal time-sharing overhead when compared against scheduling policies supported by other cluster managers such as FIFO, round-robin (RR), and SLAQ

Training of Deep Learning Models
Overview of Deep Learning Training
Hyper-Parameter Optimization
Grid Search
Random Search
Bayesian Optimization
Motivation
Overall Architecture of Hermes
Global Scheduler
Node Scheduler
18: 19: Schedule Jsched
Testbed
Workloads
Baselines
Hyper-Parameter Optimization Speed
Overhead Analysis
Related Work
Deep Learning Scheduling Frameworks
Hyper-Parameter Optimization Frameworks
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call