Abstract

Distributed Machine Learning (DML) is proposed not only to accelerate the training of machine learning, but also to solve the inadequate ability for handling a large amount of training data. It adopts multiple computing nodes in data center to collaboratively work in parallel at the cost of high communication overhead. Gradient Compression (GC) is introduced to reduce the communication overhead by reducing the number of synchronized gradients among computing nodes. However, existing GC solutions suffer from varying network congestion. To be specific, when some computing nodes experience high network congestion, their gradient transmission process could be significantly delayed, slowing down the entire training process. To solve the problem, we propose FLASH, a congestion-aware GC solution for DML. FLASH accelerates the training process by jointly considering the iterative approximation of machine learning and dynamic network congestion scenarios. It can maintain good training performance by adaptively adjust and schedule the number of synchronized gradients among computing nodes. We evaluate the effectiveness of FLASH using AlexNet and Resnet18 under different network congestion scenarios. Simulation results show that under the same number of training epochs, FLASH reduces training time 22-71%, maintains good accuracy, and low loss, compared with the existing memory top-K GC solution.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call