Elastic Parameter Server: Accelerating ML Training With Scalable Resource Scheduling

Shaoqi Wang,Xiaobo Zhou,Aidi Pi

doi:10.1109/tpds.2021.3104242

Shaoqi Wang, Xiaobo Zhou + Show 1 more

Open Access

https://doi.org/10.1109/tpds.2021.3104242

Copy DOI

Abstract

Parameter server (PS) based on worker-server communication is designed for distributed machine learning (ML) training in clusters. In feedback-driven exploration of ML model training, users exploit early feedback from each job to decide whether to kill the job or keep it running so as to find the optimal model configuration. However, PS does not support adjusting the number of workers and servers of a job at runtime. It becomes the bottleneck of scalable distributed ML training because the cluster resources cannot be dynamically allocated or deallocated to jobs, resulting in significant early feedback latency and resource under-utilization. This article rethinks the principle of PS architecture. We present Elastic Parameter Server (EPS), a lightweight and user-transparent PS that accelerates feedback-driven exploration for distributed ML training. EPS allows to remove a subset of workers and servers from running jobs and allocate the released resources to an incoming job at runtime so as to reduce its early feedback latency. It can also use the released resources from a killed job to add workers and servers to running jobs to improve resource utilization and the training speed. We develop a heuristic scheduler that leverages EPS and offers scalable resource scheduling for multiple ML jobs. We implement EPS in Tencent Angel and the scheduler in Apache Yarn, and conduct evaluations with various ML models. Experimental results show that EPS achieves up to 1.5x improvement on the ML training speed compared to PS.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Parallel and Distributed Systems	Publication Date: May 1, 2022
Citations: 11	License type: publisher-specific-oa

R Discovery Prime

R Discovery Prime

Elastic Parameter Server: Accelerating ML Training With Scalable Resource Scheduling

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems

Lead the way for us

Similar Papers

Do You Consent to the Use of Your Biological Data for Training ML and AI Models? Online Survey Targeting Clinicians and Researchers.
Yury Rusinovich ... Volha Rusinovich
Web3 Journal: ML in Health Science | VOL. 1
Yury Rusinovich, et. al.Yury Rusinovich ... Volha Rusinovich
27 Jan 2024
Web3 Journal: ML in Health Science | VOL. 1

Disclosure control of machine learning models from trusted research environments (TRE): New challenges and opportunities
Esma Mansouri-Benssassi ... Emily Jefferson
Heliyon | VOL. 9
Esma Mansouri-Benssassi, et. al.Esma Mansouri-Benssassi ... Emily Jefferson
01 Apr 2023
Heliyon | VOL. 9

MLaaS4HEP: Machine Learning as a Service for HEP
Valentin Kuznetsov ... Luca Giommi
Computing and Software for Big Science | VOL. 5
Valentin Kuznetsov, et. al.Valentin Kuznetsov ... Luca Giommi
05 Jul 2021
Computing and Software for Big Science | VOL. 5

A comparison of machine learning methods for predicting the compressive strength of field-placed concrete
M.A Derousseau ... W.V Srubar
Construction and Building Materials | VOL. 228
M.A Derousseau, et. al.M.A Derousseau ... W.V Srubar
27 Aug 2019
Construction and Building Materials | VOL. 228

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Elastic Parameter Server: Accelerating ML Training With Scalable Resource Scheduling

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems