Abstract
Distributed learning with multiple GPUs has been widely adopted to accelerate the training process of large-scale deep neural networks. However, misconfiguration of the GPU clusters with various communication primitives and topologies could potentially diminish the gains in parallel computation and lead to significant degradation in training efficiency. Predicting the performance of distributed learning enables service providers to identify potential bottlenecks beforehand. In this work, we propose a Performance prediction framework over General Topologies, called PerfTop, for accurate estimation of per-iteration execution time. The main strategy is to integrate computation time prediction with an analytical model to map the nonlinearity in communication and fine-grained computation-communication patterns. This enables accurate prediction of a variety of neural network models over general topologies, such as tree, hierarchical, and exponential. Our extensive experiments show that PerfTop outperforms existing methods in estimating both computation and communication time, particularly for communication, surpassing the existing methods by over 45%. Meanwhile, it achieves an accuracy of above 85% in predicting the execution time over general topologies compared to simple topologies such as star and ring from the previous works.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have