A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning

Songtao Wang,Yanshu Wang,Shutao Xia,Shuai Wang,Yang Cheng,Dan Li,Jianping Wu,Jinkun Geng

doi:10.1109/tnet.2020.2999377

Abstract

In large-scale distributed machine learning (DML), the network performance between machines significantly impacts the speed of iterative training. In this paper we propose BML, a scalable, high-performance and fault-tolerant DML network architecture on top of Ethernet and commodity devices. BML builds on BCube topology, and runs a fully-distributed gradient synchronization algorithm. Compared to a Fat-Tree network with the same size, a BML network is expected to take much less time for gradient synchronization, for both low theoretical synchronization time and its benefit to RDMA transport. With server/link failures, the performance of BML degrades in a graceful way. Experiments of MNIST and VGG-19 benchmarks on a testbed with 9 dual-GPU servers show that, BML reduces the job completion time of DML training by up to 56.4% compared with Fat-Tree running state-of-the-art gradient synchronization algorithm.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Networking

Lead the way for us

Journal: IEEE/ACM Transactions on Networking	Publication Date: Aug 1, 2020
Citations: 27

Similar Papers

X-NEST: A Scalable, Flexible, and High-Performance Network Architecture for Distributed Machine Learning
Yunfeng Lu ... Peng Li
Journal of Lightwave Technology | VOL. 39
Yunfeng Lu, et. al.Yunfeng Lu ... Peng Li
16 Apr 2021
Journal of Lightwave Technology | VOL. 39

<title>Optical network architecture for future global telecommunications</title>
Philip Dumortier ... Francesco B Masetti
-
Philip Dumortier, et. al.Philip Dumortier ... Francesco B Masetti
17 Feb 1995
17 Feb 1995

Coding for Large-Scale Distributed Machine Learning.
Ming Xiao ... Mikael Skoglund
Entropy | VOL. 24
Ming Xiao, et. al.Ming Xiao ... Mikael Skoglund
12 Sep 2022
Entropy | VOL. 24

HiPS
Jinkun Geng ... Dan Li
-
Jinkun Geng, et. al.Jinkun Geng ... Dan Li
01 Jan 2018
01 Jan 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Networking