Abstract

Traditional deep neural network (DNN) training is executed with data parallelism, which suffers from significant communication overheads and GPU memory consumption. Considering this, recent pioneering works have attempted to train DNN with model parallelism. However, model partition remains as a major concern and a static partition fails to adapt to the ever-changing computation environment of the cloud cluster. This paper proposes ElasticPipe, which trains the neural network based on pipe-based model parallelism. Unlike data-parallel solutions, each node in ElasticPipe only holds part of the whole model, leading to much lower cost of communication and GPU memory. More importantly, ElasticPipe is able to dynamically tune the workload distribution among different nodes, so that it can mitigate the common straggler effect in cloud environment. Our primary experiment shows, compared to the data-parallel baselines, ElasticPipe can reduce the training time by up to 89.03% without considering straggler effect, and by up to 76.72% with the existence of stragglers. Besides, ElasticPipe also outperforms its static counterpart by up to 28.81% in training performance when stragglers are involved.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call