Abstract

Asynchronous stochastic gradient descent (ASGD) is a computationally efficient algorithm, which speeds up deep learning training and plays an important role in distributed deep learning. However, ASGD suffers from the stale gradient problem, i.e., the gradient of worker may mismatch the weight of parameter server. This problem seriously affects the model performance and even causes the divergence. To address this issue, this paper designs a dynamic adjustment scheme via the momentum algorithm, which uses both stale penalty and stale compensation, i.e., stale penalty is to reduce the trust in stale gradient, stale compensation is to compensate the hurt of stale gradient. Based on this dynamic adjustment scheme, this paper proposes a dynamic asynchronous stochastic gradient descent algorithm (DASGD), which dynamically adjusts the compensation factor and the penalty factor via stale size. Moreover, we prove that DASGD is convergent under some mild assumptions. Finally, we build a real distributed training cluster to evaluate our DASGD on Cifar10 and ImageNet datasets. Compared with four SOTA baselines, experiment results confirm the superior performance of DASGD. More specifically, our DASGD has nearly the same test accuracy as SGD on Cifar10 and ImageNet, and only uses around 27.6% and 40.8% training time that of SGD, respectively.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.