Abstract

Communication scheduling is crucial to improve the efficiency of training large deep learning models with data parallelism, in which the transmission order of layer-wise deep neural network (DNN) tensors is determined for a better computation-communication overlap. Prior approaches adopt tensor partitioning to enhance the priority scheduling with finer granularity. However, a startup time slot inserted before each tensor partition will neutralize this scheduling gain. Tuning the optimal partition size is difficult and the application-layer solutions cannot eliminate the partitioning overhead. In this paper, we propose Mercury, a simple transport layer scheduler that does not partition the tensors, but moves the priority scheduling to the transport layer at the packet granularity. The packets with the highest priority in the Mercury buffer will be transmitted first. Mercury achieves the near-optimal overlapping between communication and computation. It leverages immediate aggregation at the transport layer to enable the coincident gradient push and parameter pull. We implement Mercury in MXNet and conduct comprehensive experiments on five DNN models in an 8-node cluster with 10Gbps Ethernet. Experimental results show that Mercury can achieve about 1.18 ~ 2.18 × speedup over vanilla MXNet, and 1.08 ~ 2.04× speedup over the state-of-the-art tensor partitioning solution.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call