Distributed Machine Learning with a Serverless Architecture

Hao Wang,Baochun Li,Di Niu

doi:10.1109/infocom.2019.8737391

Abstract

The need to scale up machine learning, in the presence of a rapid growth of data both in volume and in variety, has sparked broad interests to develop distributed machine learning systems, typically based on parameter servers. However, since these systems are based on a dedicated cluster of physical or virtual machines, they have posed non-trivial cluster management overhead to machine learning practitioners and data scientists. In addition, there exists an inherent mismatch between the dynamically varying resource demands during a model training job and the inflexible resource provisioning model of current cluster-based systems.In this paper, we propose SIREN, an asynchronous distributed machine learning framework based on the emerging serverless architecture, with which stateless functions can be executed in the cloud without the complexity of building and maintaining virtual machine infrastructures. With SIREN, we are able to achieve a higher level of parallelism and elasticity by using a swarm of stateless functions, each working on a different batch of data, while greatly reducing system configuration overhead. Furthermore, we propose a scheduler based on Deep Reinforcement Learning to dynamically control the number and memory size of the stateless functions that should be used in each training epoch. The scheduler learns from the training process itself, in pursuit for the minimum possible training time given a cost. With our real-world prototype implementation on AWS Lambda, extensive experimental results have shown that SIREN can reduce model training time by up to 44%, as compared to traditional machine learning training benchmarks on AWS EC2 at the same cost.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Distributed Machine Learning with a Serverless Architecture

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Accelerating Distributed Machine Learning by Smart Parameter Server
Jinkun Geng ... Dan Li
-
Jinkun Geng, et. al.Jinkun Geng ... Dan Li
17 Aug 2019
17 Aug 2019

A Parameter Communication Optimization Strategy for Distributed Machine Learning in Sensors.
Jilin Zhang ... Jue Wang
Sensors (Basel, Switzerland) | VOL. 17
Jilin Zhang, et. al.Jilin Zhang ... Jue Wang
21 Sep 2017
Sensors (Basel, Switzerland) | VOL. 17

Distributed Machine Learning through Heterogeneous Edge Systems
Hanpeng Hu ... Dan Wang
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34
Hanpeng Hu, et. al.Hanpeng Hu ... Dan Wang
03 Apr 2020
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34

Model averaging in distributed machine learning: a case study with Apache Spark
Yunyan Guo ... Zhipeng Zhang
The VLDB Journal | VOL. 30
Yunyan Guo, et. al.Yunyan Guo ... Zhipeng Zhang
15 Apr 2021
The VLDB Journal | VOL. 30

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Distributed Machine Learning with a Serverless Architecture

Abstract

Talk to us

Similar Papers