Dynamic Resource Scheduler for Distributed Deep Learning Training in Kubernetes

Muhammad Fadhriga Bestari,Achmad Imam Kistijantoro,Anggrahita Bayu Sasmita

doi:10.1109/icaicta49861.2020.9429033

Abstract

Distributed deep learning is a method of machine learning that is used today due to its many advantages. One of the many tools used to train distributed deep learning model is Kubeflow, which runs on top of Kubernetes. Kubernetes is a containerized application orchestrator that ease the deployment process of applications. This in turn makes distributed deep learning training done in Kubeflow easier and manageable. Works on a dynamic resource scheduler in Kubernetes for deep learning training have been done before, such as DRAGON that proposed scheduler with autoscaling and gang scheduling capabilities, and OASIS that proposed a utility system with price function. In this work, we propose to combine DRAGON's and OASIS' approach to make a scheduler with weighted autoscaling capabilities and schedule its jobs with gang scheduling. Some modifications are done on DRAGON's autoscaling function. We try to increase the frequency of scaling up function calls and reduce the frequency of scaling down function to make the training process more efficient. Weights are used to determine the priority of each jobs, where jobs with higher resource requirements are considered more important. Weight of each jobs will influence the autoscaling function of the scheduler. Experiment and evaluation done using a set of Tensorflow jobs results in an increase of training speed by over 26% in comparison with the default Kubernetes scheduler.

Full Text