Abstract

Training jobs of deep neural networks (DNNs) can be accelerated through distributed variants of stochastic gradient descent (SGD), where multiple nodes process training examples and exchange updates. The total throughput of the nodes depends not only on their computing power, but also on their networking speeds and coordination mechanism (synchronous or asynchronous, centralized or decentralized), since communication bottlenecks and stragglers can result in sublinear scaling when additional nodes are provisioned. In this paper, we propose two classes of performance models to predict throughput of distributed SGD: fine-grained models, representing many elementary computation/communication operations and their dependencies; and coarse-grained models, where SGD steps at each node are represented as a sequence of high-level phases without parallelism between computation and communication. Using a PyTorch implementation, real-world DNN models and different cloud environments, our experimental evaluation illustrates that, while fine-grained models are more accurate and can be easily adapted to new variants of distributed SGD, coarse-grained models can provide similarly accurate predictions when augmented with ad hoc heuristics, and their parameters can be estimated with profiling information that is easier to collect.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call