Abstract

In distributed training, workers collaboratively refine the global model parameters by pushing their updates to the Parameter Server and pulling fresher parameters for the next iteration. This introduces high communication costs for training at scale, and incurs unproductive waiting time for workers. To minimize the waiting time, existing approaches <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">overlap communication and computation</i> for deep neural networks. Yet, these techniques not only require the layer-by-layer model structures, but also need significant efforts in runtime profiling and hyperparameter tuning. To make the overlapping optimization <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">simple</i> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">generic</i> , in this article, we propose a new Parameter Server framework. Our solution <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">decouples</i> the dependency between push and pull operations, and allows workers to <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">eagerly</i> pull the global parameters. This way, both push and pull operations can be easily overlapped with computations. Besides, the overlapping manner offers a different way to address the straggler problem, where the stale updates greatly retard the training process. In the new framework, with adequate information available to workers, they can explicitly modulate the learning rates for their updates. Thus, the global parameters can be less compromised by stale updates. We implement a prototype system in PyTorch and demonstrate its effectiveness on both CPU/GPU clusters. Experimental results show that our prototype saves up to 54% less time for each iteration and up to 37% fewer iterations for model convergence, achieving up to 2.86× speedup over widely-used synchronization schemes.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.