Cirrus

Joao Carreira,Alexey Tumanov,Andrew Zhang,Pedro Fonseca,Randy Katz

doi:10.1145/3357223.3362711

Abstract

Machine learning (ML) workflows are extremely complex. The typical workflow consists of distinct stages of user interaction, such as preprocessing, training, and tuning, that are repeatedly executed by users but have heterogeneous computational requirements. This complexity makes it challenging for ML users to correctly provision and manage resources and, in practice, constitutes a significant burden that frequently causes over-provisioning and impairs user productivity. Serverless computing is a compelling model to address the resource management problem, in general, but there are numerous challenges to adopt it for existing ML frameworks due to significant restrictions on local resources. This work proposes Cirrus---an ML framework that automates the end-to-end management of datacenter resources for ML workflows by efficiently taking advantage of serverless infrastructures. Cirrus combines the simplicity of the serverless interface and the scalability of the serverless infrastructure (AWS Lambdas and S3) to minimize user effort. We show a design specialized for both serverless computation and iterative ML training is needed for robust and efficient ML training on serverless infrastructure. Our evaluation shows that Cirrus outperforms frameworks specialized along a single dimension: Cirrus is 100x faster than a general purpose serverless system [36] and 3.75x faster than specialized ML frameworks for traditional infrastructures [49].

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Cirrus

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Security in Machine Learning (ML) Workflows
Dinesh Reddy Chittibala ... Srujan Reddy Jabbireddy
International Journal of Computing and Engineering | VOL. 5
Dinesh Reddy Chittibala, et. al.Dinesh Reddy Chittibala ... Srujan Reddy Jabbireddy
02 Mar 2024
International Journal of Computing and Engineering | VOL. 5

GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows
Tim Hegeman ... Animesh Trivedi
-
Tim Hegeman, et. al.Tim Hegeman ... Animesh Trivedi
19 Apr 2021
19 Apr 2021

Scanflow-K8s: Agent-based Framework for Autonomic Management and Supervision of ML Workflows in Kubernetes Clusters
Peini Liu ... David Ellison
-
Peini Liu, et. al.Peini Liu ... David Ellison
01 May 2022
01 May 2022

A Critical Literature Review on Rock Petrophysical Properties Estimation from Images Based on Direct Simulation and Machine Learning Techniques
Emad W Al-Shalabi ... Moussa Tembely
-
Emad W Al-Shalabi, et. al.Emad W Al-Shalabi ... Moussa Tembely
09 Dec 2021
09 Dec 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cirrus

Abstract

Talk to us

Similar Papers