Minimizing Training Time of Distributed Machine Learning by Reducing Data Communication

Yubin Duan,Ning Wang,Jie Wu

doi:10.1109/tnse.2021.3073897

Yubin Duan, Ning Wang + Show 1 more

Open Access

https://doi.org/10.1109/tnse.2021.3073897

Copy DOI

Abstract

Due to the additive property of most machine learning objective functions, the training can be distributed to multiple machines. Distributed machine learning is an efficient way to deal with the rapid growth of data volume at the cost of extra inter-machine communication. One common implementation is the parameter server system which contains two types of nodes: worker nodes, which are used for calculating updates, and server nodes, which are used for maintaining parameters. We observe that inefficient communication between workers and servers may slow down the system. Therefore, we propose a graph partition problem to partition data among workers and parameters among servers such that the total training time is minimized. Our problem is NP-Complete. We investigate a two-step heuristic approach that first partitions data, and then partitions parameters. We consider the trade-off between partition time and the saving in training time. Besides, we adopt a multilevel graph partition approach to fit the bipartite graph partitioning. We implement both approaches based on an open-source parameter server platform-PS-lite. Experiment results on synthetic and real-world datasets show that both approaches could significantly improve the communication efficiency up to 14 times compared with the random partition.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Network Science and Engineering	Publication Date: Apr 1, 2021
Citations: 13	License type: publisher-specific, author manuscript

R Discovery Prime

R Discovery Prime

Minimizing Training Time of Distributed Machine Learning by Reducing Data Communication

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Network Science and Engineering

Lead the way for us

Similar Papers

A combined priority scheduling method for distributed machine learning
TianTian Du ... Wen Li
EURASIP journal on wireless communications and networking | VOL. 2023
TianTian Du, et. al.TianTian Du ... Wen Li
29 May 2023
EURASIP journal on wireless communications and networking | VOL. 2023

Dynamic Resource Allocation for Distributed TensorFlow Training in Kubernetes Cluster
Rahmad Yesa Surya ... Achmad Imam Kistijantoro
-
Rahmad Yesa Surya, et. al.Rahmad Yesa Surya ... Achmad Imam Kistijantoro
01 Nov 2019
01 Nov 2019

Accelerating Distributed Machine Learning by Smart Parameter Server
Jinkun Geng ... Shuai Wang
-
Jinkun Geng, et. al.Jinkun Geng ... Shuai Wang
17 Aug 2019
17 Aug 2019

Distributed Machine Learning with a Serverless Architecture
Hao Wang ... Baochun Li
-
Hao Wang, et. al.Hao Wang ... Baochun Li
01 Apr 2019
01 Apr 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Minimizing Training Time of Distributed Machine Learning by Reducing Data Communication

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Network Science and Engineering