Abstract
In distributed deep learning with data parallelism, communication bottleneck throttles the efficiency of model training. Recent studies adopt versatile gradient compression techniques, with communication sparsification standing out as an effective approach for reducing the number of gradients to be transmitted. However, the deployment of gradient sparsification is adversely influenced by the change of network environment in real systems, and existing methods either neglect bandwidth dynamics during training or experience drastic fluctuation of compression ratios. In this paper, we propose ACE, a novel adaptive gradient compression mechanism with high communication efficiency under bandwidth variation. ACE adapts the sparsification ratio to the average bandwidth in a time window, other than following its dynamics exactly. To accurately compute the compression ratio, we first profile the compression time and model a single iteration time consisting of communication, computation and compression operations. We then present a practical model to fit the needed training rounds till convergence, and formulate an optimization problem to compute the optimal sparsification ratio. We conduct experiments on different DNN models in different network environments and compare various methods in terms of convergence and model quality. The experimental results show that ACE achieves up to 9.39× and 1.28× training speedups over fixed and state-of-the-art adaptive compression methods.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have