Abstract

Communication scheduling is crucial to accelerate the training of large deep learning models, in which the transmission order of layer-wise deep neural network (DNN) tensors is determined for a better computation-communication overlap. Prior approaches adopt user-level tensor partitioning to enhance the priority scheduling with finer granularity. However, a startup time slot inserted before every tensor partition will neutralize this scheduling gain. Tuning hyper-parameters for tensor partitioning is difficult, especially when the network bandwidth is shared or time-varying in multi-tenant clusters. In this paper, we propose Mercury, a simple transport layer scheduler that moves the priority scheduling to the transport layer at the packet granularity. The packets with the highest priority in the Mercury buffer will be transmitted first. Mercury achieves the near-optimal overlapping between communication and computation. It also leverages the immediate aggregation at the transport layer to enable the full overlapping of gradient push and pull. We implement Mercury in MXNet and conduct comprehensive experiments on five popular DNN models in various environments. Mercury can well adapt to dynamic communication and computation resources. Experiments show that Mercury accelerates the training by up to 130% compared to the classical PS architecture, and 104% compared to state-of-the-art tensor partitioning methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call