DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment

Wei Qiao,Ying Li,Zhong-Hai Wu

doi:10.1051/itmconf/20171203030

Abstract

Deep neural networks (DNNs) have recently yielded strong results on a range of applications. Training these DNNs using a cluster of commodity machines is a promising approach since training is time consuming and compute-intensive. Furthermore, putting DNN tasks into containers of clusters would enable broader and easier deployment of DNN-based algorithms. Toward this end, this paper addresses the problem of scheduling DNN tasks in the containerized cluster environment. Efficiently scheduling data-parallel computation jobs like DNN over containerized clusters is critical for job performance, system throughput, and resource utilization. It becomes even more challenging with the complex workloads. We propose a scheduling method called Deep Learning Task Allocation Priority (DLTAP) which performs scheduling decisions in a distributed manner, and each of scheduling decisions takes aggregation degree of parameter sever task and worker task into account, in particularly, to reduce cross-node network transmission traffic and, correspondingly, decrease the DNN training time. We evaluate the DLTAP scheduling method using a state-of-the-art distributed DNN training framework on 3 benchmarks. The results show that the proposed method can averagely reduce 12% cross-node network traffic, and decrease the DNN training time even with the cluster of low-end servers.

Highlights

Large-scale deep learning, for instance, Deep Neutral Network (DNN), has driven advances with higher accuracy than traditional techniques in many different fields especially at image classification [1, 2], speech recognition [3] and text processing [4]
Compared to DistBelief, Tensorflow is more flexible, faster and easy to use. It introduces a new distributed programming paradigm that a DNN job is divided into a number of parameter server tasks and worker tasks, which can be flexibly allocated in the cluster
Our work points out a network transmission issue for deep learning workload in the containerized cluster environment

Summary

Introduction

Large-scale deep learning, for instance, Deep Neutral Network (DNN), has driven advances with higher accuracy than traditional techniques in many different fields especially at image classification [1, 2], speech recognition [3] and text processing [4]. It is different from previous parameter server concept [5] that there is no need for a centralized global parameter server to store all the parameters of a job This flexible training approach brings some benefits, but on the other hand introduces new challenges: (a) how to provision resources for tasks in a timely and elastic way; and (b) how to schedule ps and worker tasks on the distributed environment to improve job performance. The main objective is to take aggregation degree of ps task and worker task into consideration, and the comparison demonstrates that it can greatly reduce the cross-node network transmission traffic caused by ps task and worker task In this way, we can accelerate the training speed of DNNs. The remainder of this paper is organized as follows.

Related Works

Proposed Method

Experiments Evaluation

Summary

Findings

15. Apache