Abstract
Deep neural networks have become one of the popular techniques used in many research and application areas including computer vision, natural language processing, etc. As the complexity of neural networks continuously increasing, the training process takes a much longer time and requires more computation resources. To speed up the training process, a centralized distributed training structure named Parameter Server (PS) is widely used to assign training tasks to different workers/nodes. Most existing studies considered all workers having the same computation resources. However, in a heterogeneous environment, fast workers (i.e., workers having more computation resources) can complete tasks earlier than slow workers and thus the system does not fully utilize the resources of fast workers. In this paper, we propose a PS model with heterogeneous types of workers/nodes, called H-PS, which can fully utilize the resources of each worker by dynamically scheduling tasks based on the current status of the workers (e.g., available memory). By doing so, the workers will complete their tasks at the same time and the stragglers (i.e., workers fall behind others) can be avoided. In addition, a pipeline scheme is proposed to further improve the effectiveness of workers by fully utilizing the resources of workers during the time of parameters transmitting between PS and workers. Moreover, a flexible quantization scheme is proposed to reduce the communication overhead between the PS and workers. Finally, the H-PS is implemented using Containers which is an emerging lightweight technology. The experimental results indicate that the proposed H-PS can reduce the overall training time by 1.4x – 3.5x when compared with existing methods.
Highlights
In recent years, deep neural networks have become one of the most popular techniques which are successfully applied in many research and application fields including computer vision, natural language processing, systems management, Internet of Things (IoT), and etc. [1], [2]
We propose a heterogeneous-aware parameter server model, which focuses on speeding up the training process of deep neural networks in a heterogeneous environment
To improve the training performance of the Parameter Server (PS) system, the proposed scheme is designed from three aspects: 1) Dynamically allocate workloads according to the current computing capacities of the workers; 2) Keep workers training during the period of parameter communication to fully utilize the system resources; and 3) Apply flexible quantized parameters according to the change of accuracy in the training process to reduce the total amount of communication data
Summary
Deep neural networks have become one of the most popular techniques which are successfully applied in many research and application fields including computer vision, natural language processing, systems management, Internet of Things (IoT), and etc. [1], [2]. The workers focus on training tasks such as forward and backward propagation The distributed systems such as Spark [8], GraphX [9] and MLlib [10] assume that all machines are identical (i.e., having the same configuration including the same size of memory). In other words, they train neural networks in a homogeneous environment. The other is that the systems may have different hardware configurations, such as different numbers of CPUs, variable memory sizes, and dynamic changing network bandwidths If these systems/workers are assigned with the same workloads, the workers with more available resources (denoted as fast workers) can complete their tasks faster than those with less available resources (denoted as slow workers).
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have