Abstract

Distributed deep learning training nowadays uses static resource allocation. Using parameter server architecture, deep learning training is carried out by several parameter server (ps) nodes and worker nodes, which have a constant amount throughout the training. Training job cannot utilize more resources in the middle of its run, even though additional free resources in the cluster are available. This causes the training job being unable to run faster than it shoulsd be. Besides that, if those free resources are unused for a long time, the cluster’s resources become underutilized. In this research, dynamic resource allocation is designed and implemented for TensorFlow-based training job which runs on top of Kubernetes cluster. The implementation is done by creating a component called Config Manager (CM). The role of this component is to know the cluster’s resources information at a certain time, as well as adding more ps and worker nodes to a training job once free resources exist. The experiment shows that the training with dynamic resource allocation with Config Manager has better performance than one with static resource allocation on the following metrics: resource utilization, epoch time, and total training time. Total training time can be reduced to more than 50%, while the cluster’s resources utilization can be kept up high.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.