Revisiting Resource Management for Deep Learning Framework

Erci Xu,Shanshan Li

doi:10.3390/electronics8030327

Abstract

The recent adoption of deep learning for diverse applications has required infrastructures to be scaled horizontally and hybrid configured vertically. As a result, efficient resource management for distributed deep learning (DDL) frameworks is becoming increasingly important. However, existing techniques for scaling DDL applications rely on general-purpose resource managers originally designed for data intensive applications. In contrast, DDL applications present unique challenges for resource management as compared to traditional big data frameworks, such as a different master–slave communication paradigm, deeper ML models that are more computationally and network bounded than I/O, the use of heterogeneous resources (e.g., GPUs, TPUs) and the variable memory requirement. In addition, most DDL frameworks require data scientists to manually configure the task placement and resource assignment to execute DDL models. In this paper, we present Dike, an automatic resource management framework that transparently makes scheduling decisions for placement and resource assignment to DDL workers and parameter servers, based on the unique characteristics of the DDL model (number and type of parameters and neural network layers), node heterogeneity (CPU/GPU ratios), and input dataset. We implemented Dike as a resource manager for DDL jobs in Tensorflow on top of Apache Mesos. We show that Dike significantly outperformed both manual and static assignment of resource offers to Tensorflow tasks, and achieved at least 95% of the optimal throughput for different DDL models such as ResNet and Inception.

Highlights

Today, distributed deep learning (DDL) is widely used in different areas ranging from image classification to speech recognition [1,2]
The data scientist has to address the following four questions while deploying the DDL model: (1) How many DDL tasks need to be launched? (2) How much resource allocated for each task? (3) What is the role or functionality of each task? (4) Which physical node should be used for launching each task? As these questions are specific to the requirements of the DDL model and have multiple possible answers, users need to iteratively try out different deployment plans, which requires considerable manual tuning
We show that Dike achieves at least 95% of the optimal performance for distributed DDL workloads and automates most of the cluster resource management

Summary

Introduction

Today, distributed deep learning (DDL) is widely used in different areas ranging from image classification to speech recognition [1,2]. Resource management in major DDL frameworks is still evolving and does not account for the unique characteristics of the machine learning jobs. The data scientist has to address the following four questions while deploying the DDL model: (1) How many DDL tasks need to be launched? Resource assignment determines the total numbers of tasks and the resource to be allocated for each task. A Spark application may determine the total number of tasks and resource binding based on available memory and the size of data input. In Hadoop tends to assign tasks based on data locality. Both parts are mainly handled by manual effort in DDL due to performance concern and lacking of competitive tools

Objectives

Methods

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Revisiting Resource Management for Deep Learning Framework

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Journal: Electronics	Publication Date: Mar 16, 2019
License type: CC BY 4.0

Similar Papers

Resource management in the Cronus distributed operating system
R Schantz ... K Schroder
ACM SIGCOMM Computer Communication Review | VOL. 17
R Schantz, et. al.R Schantz ... K Schroder
01 Aug 1987
ACM SIGCOMM Computer Communication Review | VOL. 17

Resource management in the Cronus distributed operating system
R. Schantz ... K. Schroder
-
R. Schantz, et. al.R. Schantz ... K. Schroder
01 Aug 1987
01 Aug 1987

Collective Communication Performance Evaluation for Distributed Deep Learning Training
Sookwang Lee ... Jaehwan Lee
Applied Sciences | VOL. 14
Sookwang Lee, et. al.Sookwang Lee ... Jaehwan Lee
12 Jun 2024
Applied Sciences | VOL. 14

Instance segmentation on distributed deep learning big data cluster
Mohammed Elhmadany ... Hossam E Abdelmunim
Journal of Big Data | VOL. 11
Mohammed Elhmadany, et. al.Mohammed Elhmadany ... Hossam E Abdelmunim
02 Jan 2024
Journal of Big Data | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Revisiting Resource Management for Deep Learning Framework

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics