Abstract

Compared with traditional cloud computing, edge-cloud computing brings many benefits, such as low latency, low bandwidth cost, and high security. Thanks to these advantages, a large number of distributed machine learning (ML) jobs are trained on the edge-cloud network to support smart applications, adopting the parameter server (PS) architecture. The scheduling of such ML jobs needs to consider different data transmission delay and frequent communication between workers and PSs, which brings a fundamental challenge: how to deploy workers and PSs on edge-cloud networks for ML jobs to minimize the average job completion time. To solve this problem, we propose an online scheduling framework to determine the location and execution time window for each job upon its arrival. Our algorithm includes: (i) an online scheduling framework that groups unprocessed ML jobs iteratively into multiple batches; (ii) a batch scheduling algorithm that maximizes the number of scheduled jobs in the current batch; (iii) two greedy algorithms that deploy workers and PSs to minimize the deployment cost. Large-scale and trace-driven simulations show that our algorithm is superior to the most common and advanced schedulers in today’s cloud systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call