Online Scheduling of Machine Learning Jobs in Edge-Cloud Networks

Jingping She,Ne Wang,Ruiting Zhou,Chen Tian

doi:10.1109/nana53684.2021.00031

Abstract

Compared with traditional cloud computing, edge-cloud computing brings many benefits, such as low latency, low bandwidth cost, and high security. Thanks to these advantages, a large number of distributed machine learning (ML) jobs are trained on the edge-cloud network to support smart applications, adopting the parameter server (PS) architecture. The scheduling of such ML jobs needs to consider different data transmission delay and frequent communication between workers and PSs, which brings a fundamental challenge: how to deploy workers and PSs on edge-cloud networks for ML jobs to minimize the average job completion time. To solve this problem, we propose an online scheduling framework to determine the location and execution time window for each job upon its arrival. Our algorithm includes: (i) an online scheduling framework that groups unprocessed ML jobs iteratively into multiple batches; (ii) a batch scheduling algorithm that maximizes the number of scheduled jobs in the current batch; (iii) two greedy algorithms that deploy workers and PSs to minimize the deployment cost. Large-scale and trace-driven simulations show that our algorithm is superior to the most common and advanced schedulers in today’s cloud systems.

Full Text