Rethinking Transport Layer Design for Distributed Machine Learning

Gaoxiong Zeng,Jiacheng Xia,Junchen Jiang,Kai Chen,Wei Bai,Weiyan Wang,Junxue Zhang

doi:10.1145/3343180.3343186

Abstract

Motivated by the increasing scale of data, we see a growing need of high performance distributed machine learning systems. Many research works are being proposed to improve distributed machine learning performance.In this paper, we call upon this community to rethink transport layer solutions for distributed machine learning due to their stringent network requirements and special algorithmic properties. Distributed machine learning jobs generate bursty traffic when synchronizing parameters and a long tail flow can significantly slow down the complete training process. Meanwhile, in contrast to other distributed system applications, we find that machine learning algorithms are bounded-loss tolerant: randomized network data losses below a certain fraction (typically 10%--35%) will do little harm to the end to end job performance. Motivated by this observation, we highlight new opportunities to design bounded-loss tolerant transport to optimize the performance of distributed machine learning. By intentionally ignoring some packet losses, we can avoid unnecessary loss retransmissions, thus reducing the tail flow completion time. Following this principle, our preliminary results show that a simplified protocol can give 1.1-2.2x speedup on different distributed machine learning tasks.

Full Text