Communication-efficient ADMM-based distributed algorithms for sparse training

Guozheng Wang,Yongmei Lei,Yongwen Qiu,Lingfei Lou,Yixin Li

doi:10.1016/j.neucom.2023.126456

Abstract

In large-scale distributed machine learning (DML), the synchronization efficiency of the distributed algorithm becomes a critical factor that affects the training time of machine learning models as the computing scale increases. To address this challenge, we propose a novel algorithm called Grouped Sparse AllReduce based on the 2D-Torus topology (2D-TGSA), which enables constant transmission traffic that does not change with the number of workers. Our experimental results demonstrate that 2D-TGSA outperforms several benchmark algorithms in terms of synchronization efficiency. Moreover, we integrate the general form consistent ADMM with 2D-TGSA to develop a distributed algorithm (2D-TGSA-ADMM) that exhibits excellent scalability and can effectively handle large-scale distributed optimization problems. Furthermore, we enhance 2D-TGSA-ADMM by adopting the resilient adaptive penalty parameter approach, resulting in a new algorithm called 2D-TGSA-TPADMM. Our experiments on training the logistic regression model with ℓ1-norm on the Tianhe-2 supercomputing platform demonstrate that our proposed algorithm can significantly reduce the synchronization time and training time compared to state-of-the-art methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Communication-efficient ADMM-based distributed algorithms for sparse training

Abstract

Talk to us

Similar Papers

More From: Neurocomputing

Lead the way for us

Journal: Neurocomputing	Publication Date: Jun 21, 2023
Citations: 2

Similar Papers

Coding for Large-Scale Distributed Machine Learning.
Ming Xiao ... Mikael Skoglund
Entropy | VOL. 24
Ming Xiao, et. al.Ming Xiao ... Mikael Skoglund
12 Sep 2022
Entropy | VOL. 24

Distributed Graph Computation Meets Machine Learning
Wencong Xiao ... Zhen Li
IEEE Transactions on Parallel and Distributed Systems | VOL. 31
Wencong Xiao, et. al.Wencong Xiao ... Zhen Li
20 Apr 2020
IEEE Transactions on Parallel and Distributed Systems | VOL. 31

A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning
Songtao Wang ... Jinkun Geng
IEEE/ACM Transactions on Networking | VOL. 28
Songtao Wang, et. al.Songtao Wang ... Jinkun Geng
01 Aug 2020
IEEE/ACM Transactions on Networking | VOL. 28

Distributed Machine Learning with a Serverless Architecture
Hao Wang ... Di Niu
-
Hao Wang, et. al.Hao Wang ... Di Niu
01 Apr 2019
01 Apr 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Communication-efficient ADMM-based distributed algorithms for sparse training

Abstract

Talk to us

Similar Papers

More From: Neurocomputing