Abstract

Parameter server (PS) based on worker-server communication is designed for distributed machine learning (ML) training in clusters. In feedback-driven exploration of ML model training, users exploit early feedback from each job to decide whether to kill the job or keep it running so as to find the optimal model configuration. However, PS does not support adjusting the number of workers and servers of a job at runtime. It becomes the bottleneck of scalable distributed ML training because the cluster resources cannot be dynamically allocated or deallocated to jobs, resulting in significant early feedback latency and resource under-utilization. This article rethinks the principle of PS architecture. We present Elastic Parameter Server (EPS), a lightweight and user-transparent PS that accelerates feedback-driven exploration for distributed ML training. EPS allows to remove a subset of workers and servers from running jobs and allocate the released resources to an incoming job at runtime so as to reduce its early feedback latency. It can also use the released resources from a killed job to add workers and servers to running jobs to improve resource utilization and the training speed. We develop a heuristic scheduler that leverages EPS and offers scalable resource scheduling for multiple ML jobs. We implement EPS in Tencent Angel and the scheduler in Apache Yarn, and conduct evaluations with various ML models. Experimental results show that EPS achieves up to 1.5x improvement on the ML training speed compared to PS.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call