Abstract

The recent adoption of deep learning for diverse applications has required infrastructures to be scaled horizontally and hybrid configured vertically. As a result, efficient resource management for distributed deep learning (DDL) frameworks is becoming increasingly important. However, existing techniques for scaling DDL applications rely on general-purpose resource managers originally designed for data intensive applications. In contrast, DDL applications present unique challenges for resource management as compared to traditional big data frameworks, such as a different master–slave communication paradigm, deeper ML models that are more computationally and network bounded than I/O, the use of heterogeneous resources (e.g., GPUs, TPUs) and the variable memory requirement. In addition, most DDL frameworks require data scientists to manually configure the task placement and resource assignment to execute DDL models. In this paper, we present Dike, an automatic resource management framework that transparently makes scheduling decisions for placement and resource assignment to DDL workers and parameter servers, based on the unique characteristics of the DDL model (number and type of parameters and neural network layers), node heterogeneity (CPU/GPU ratios), and input dataset. We implemented Dike as a resource manager for DDL jobs in Tensorflow on top of Apache Mesos. We show that Dike significantly outperformed both manual and static assignment of resource offers to Tensorflow tasks, and achieved at least 95% of the optimal throughput for different DDL models such as ResNet and Inception.

Highlights

  • Today, distributed deep learning (DDL) is widely used in different areas ranging from image classification to speech recognition [1,2]

  • The data scientist has to address the following four questions while deploying the DDL model: (1) How many DDL tasks need to be launched? (2) How much resource allocated for each task? (3) What is the role or functionality of each task? (4) Which physical node should be used for launching each task? As these questions are specific to the requirements of the DDL model and have multiple possible answers, users need to iteratively try out different deployment plans, which requires considerable manual tuning

  • We show that Dike achieves at least 95% of the optimal performance for distributed DDL workloads and automates most of the cluster resource management

Read more

Summary

Introduction

Today, distributed deep learning (DDL) is widely used in different areas ranging from image classification to speech recognition [1,2]. Resource management in major DDL frameworks is still evolving and does not account for the unique characteristics of the machine learning jobs. The data scientist has to address the following four questions while deploying the DDL model: (1) How many DDL tasks need to be launched? Resource assignment determines the total numbers of tasks and the resource to be allocated for each task. A Spark application may determine the total number of tasks and resource binding based on available memory and the size of data input. In Hadoop tends to assign tasks based on data locality. Both parts are mainly handled by manual effort in DDL due to performance concern and lacking of competitive tools

Objectives
Methods
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.