A GPU Scheduling Framework to Accelerate Hyper-Parameter Optimization in Deep Learning Clusters

Jaewon Son,Youngjae Kim,Khu-Rai Kim,Yonghyuk Yoo,Kwonyong Lee,Sungyong Park

doi:10.3390/electronics10030350

Abstract

This paper proposes Hermes, a container-based preemptive GPU scheduling framework for accelerating hyper-parameter optimization in deep learning (DL) clusters. Hermes accelerates hyper-parameter optimization by time-sharing between DL jobs and prioritizing jobs with more promising hyper-parameter combinations. Hermes’s scheduling policy is grounded on the observation that good hyper-parameter combinations converge quickly in the early phases of training. By giving higher priority to fast-converging containers, Hermes’s GPU preemption mechanism can accelerate training. This enables users to find optimal hyper-parameters faster without losing the progress of a container. We have implemented Hermes over Kubernetes and compared its performance against existing scheduling frameworks. Experiments show that Hermes reduces the time for hyper-parameter optimization up to 4.04 times against previously proposed scheduling policies such as FIFO, round-robin (RR), and SLAQ, with minimal time-sharing overhead.

Highlights

Deep learning (DL) has recently seen immense success in various fields, such as computer vision and natural language processing
We evaluate the performance of Hermes using the convolutional neural network (CNN) benchmark [14] from TensorFlow
Performance results show that Hermes shortens the hyper-parameter optimization process up to 4.04 times with minimal time-sharing overhead when compared against scheduling policies supported by other cluster managers such as FIFO, round-robin (RR), and SLAQ

Summary

Introduction

Deep learning (DL) has recently seen immense success in various fields, such as computer vision and natural language processing. In kill-based preemption, users continuously monitor the convergence of DL training jobs and manually kill the containers not making progress. SLAQ [12] prioritizes low-performing DL jobs by receiving convergence feedback This can harm the performance of hyper-parameter optimization by executing unpromising hyper-parameter combinations first. Convergence-aware scheduling policy: Hermes accelerates hyper-parameter optimization by prioritizing jobs based on the convergence speed. This sharply contrasts to the Gandiva’s approach [7] since not all tasks are trained but important tasks are selected and accelerated. Performance results show that Hermes shortens the hyper-parameter optimization process up to 4.04 times with minimal time-sharing overhead when compared against scheduling policies supported by other cluster managers such as FIFO, round-robin (RR), and SLAQ

Training of Deep Learning Models

Overview of Deep Learning Training

Hyper-Parameter Optimization

Grid Search

Random Search

Bayesian Optimization

Motivation

Overall Architecture of Hermes

Global Scheduler

Node Scheduler

18: 19: Schedule Jsched

Testbed

Workloads

Baselines

Hyper-Parameter Optimization Speed

Overhead Analysis

Related Work

Deep Learning Scheduling Frameworks

Hyper-Parameter Optimization Frameworks

Findings

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Electronics	Publication Date: Feb 2, 2021
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A GPU Scheduling Framework to Accelerate Hyper-Parameter Optimization in Deep Learning Clusters

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Similar Papers

DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters
Yanghua Peng ... Chen Meng
IEEE Transactions on Parallel and Distributed Systems | VOL. 32
Yanghua Peng, et. al.Yanghua Peng ... Chen Meng
18 Aug 2020
IEEE Transactions on Parallel and Distributed Systems | VOL. 32

PickyMan: A Preemptive Scheduler for Deep Learning Jobs on GPU Clusters
Chen Chen ... Yingwen Chen
-
Chen Chen, et. al.Chen Chen ... Yingwen Chen
11 Nov 2022
11 Nov 2022

Horus: An Interference-Aware Resource Manager for Deep Learning Systems
Gingfung Yeung ... Damian Borowiec
-
Gingfung Yeung, et. al.Gingfung Yeung ... Damian Borowiec
01 Jan 2020
01 Jan 2020

A QoS-oriented Scheduling and Autoscaling Framework for Deep Learning
Sikai Xing ... Jian Cao
-
Sikai Xing, et. al.Sikai Xing ... Jian Cao
01 Jul 2019
01 Jul 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A GPU Scheduling Framework to Accelerate Hyper-Parameter Optimization in Deep Learning Clusters

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics