It's common practice to speed up machine learning (ML) training by distributing it across a cluster of computing nodes. Data-parallel distributed ML (DML) training relieves the pressure of computing node; however, the communication traffic introduced during parameter synchronization becomes bottleneck of DML training. Regarding communication bottleneck, we identify two primary causes: high contention of concurrent communication and large volume of redundant transmission in the push and pull stage while synchronizing parameters. To address these issues, we propose a novel Group Stale Synchronous Parallel (GSSP) scheme, which divides the nodes into groups and coordinates groups to synchronize in a circular order. GSSP mitigates network contention and is proven to converge. We provide analysis of the optimal group's number based on bandwidth and buffer size. For reducing traffic redundancy, we propose a multicast-based scheme, which generates multicast trees by minimizing links overlap and allocates transmission rate to multicast flows by solving the min-max optimization problem. Finally, we conduct extensive simulations to evaluate the performance of our proposals. We simulate parameter transmission of All-Reduce and parameter server in Fat-Tree with traffics trace of ML models. Simulation results show that our proposals provide communication-efficiency for DML training by mitigating contention and reducing redundancy.
Read full abstract