Dynamic Resource Allocation for Distributed TensorFlow Training in Kubernetes Cluster

Rahmad Yesa Surya,Achmad Imam Kistijantoro

doi:10.1109/icodse48700.2019.9092758

Abstract

Distributed deep learning training nowadays uses static resource allocation. Using parameter server architecture, deep learning training is carried out by several parameter server (ps) nodes and worker nodes, which have a constant amount throughout the training. Training job cannot utilize more resources in the middle of its run, even though additional free resources in the cluster are available. This causes the training job being unable to run faster than it shoulsd be. Besides that, if those free resources are unused for a long time, the cluster’s resources become underutilized. In this research, dynamic resource allocation is designed and implemented for TensorFlow-based training job which runs on top of Kubernetes cluster. The implementation is done by creating a component called Config Manager (CM). The role of this component is to know the cluster’s resources information at a certain time, as well as adding more ps and worker nodes to a training job once free resources exist. The experiment shows that the training with dynamic resource allocation with Config Manager has better performance than one with static resource allocation on the following metrics: resource utilization, epoch time, and total training time. Total training time can be reduced to more than 50%, while the cluster’s resources utilization can be kept up high.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Dynamic Resource Allocation for Distributed TensorFlow Training in Kubernetes Cluster

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Performance comparison of different protocols for a two-hop multiple relay network
Aimal Khan ... Volker Kuehn
-
Aimal Khan, et. al.Aimal Khan ... Volker Kuehn
01 Mar 2012
01 Mar 2012

A Review of Dynamic Resource Allocation Framework for Large Amount of Cloud Enterprises
B Vijaya Laxmi, Et Al
Turkish Journal of Computer and Mathematics Education (TURCOMAT) | VOL. 12
B Vijaya Laxmi, Et AlB Vijaya Laxmi, Et Al
11 Apr 2021
Turkish Journal of Computer and Mathematics Education (TURCOMAT) | VOL. 12

Calculating call blocking and utilization for communication satellites that use dynamic resource allocation
L Rosenbaum ... L Birch
-
L Rosenbaum, et. al.L Rosenbaum ... L Birch
01 Mar 2012
01 Mar 2012

Dynamically Scheduling a Component-Based Framework in Clusters
Aleksandra Kuzmanovska ... Dick Epema
-
Aleksandra Kuzmanovska, et. al.Aleksandra Kuzmanovska ... Dick Epema
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Dynamic Resource Allocation for Distributed TensorFlow Training in Kubernetes Cluster

Abstract

Talk to us

Similar Papers