Abstract

It has become a common practice to train large machine learning (ML) models across a cluster of computing nodes connected by RDMA-enabled networks. However, the communication overhead caused by parameter synchronization deteriorates the performance of such distributed ML (DML), especially in a large-scale setting. This paper tackles this issue by developing a traffic management scheme to support DML traffic, called TMDML (Traffic Management for DML), which needs only a minor modification to the existing RDMA congestion control scheme DCQCN. We assume that there is only one instance of DML workload running in a network. Existing literature has shown that Fat-Tree, a predominant topology in the data center, poorly supports DML compared with BCube. With our proposed TMDML, training DML in Fat-Tree can achieve better performance than that in BCube. We first study the impact of multi-bottlenecks on DML via NS-3-based simulations. The results show that DCQCN is inefficient for DML traffic in the multi-bottlenecks scenario. To mitigate the impact of multi-bottlenecks, we propose an optimization model to minimize the maximum flow completion time (FCT) while stabilizing the queues, and then apply the Lyapunov optimization technique to solve the problem. For all the practical purposes, we present two heuristic implementations of TMDML for different deployment requirements. We evaluate the performance of our proposals by simulation, comparing it with DCQCN. We use All-Reduce parameter synchronization in Fat-Tree and BCube with traffic trace of modern deep neural network models, including AlexNet, ResNet50, and VGG-16. Our proposals can achieve up to 59% of the time reduction.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call