Job scheduling for distributed machine learning in optical WAN

Ling Liu,Hongfang Yu,Gang Sun,Long Luo,Qixuan Jin,Shouxi Luo

doi:10.1016/j.future.2020.06.007

Abstract

Large companies operate tens of data centers (DCs) across the globe to serve their customers and store data. On the other hand, many machine learning applications need a global view of such global data to pursue high model accuracy. However, for this Geo-distributed machine learning (Geo-DML), it is infeasible to move all data together over wide-area networks (WANs) due to scarce WAN bandwidth, privacy concerns and data sovereignty laws. Therefore, most Geo-DML systems leverage geo-distributed approaches to train models, where global model synchronization is required between DCs over WAN. With the rapid increase of training data and the model sizes, it is challenging to efficiently utilize scarce and heterogeneous WAN bandwidth to synchronize models. With the advancement of optical technology, network topology becomes reconfigurable in optical WAN, which brings a new opportunity for Geo-DML training over WAN.We propose to optimize Geo-DML training with centralized joint control of the network and reconfigurable optical layers. We respectively prove the intra-job and inter-job scheduling problems are NP-hard and strongly NP-hard. For intra-job scheduling, RoWAN based on deterministic rounding algorithm, is presented to dynamically change the topology by reconfiguring the optical devices, and allocate path and rate for each flow. For inter-job scheduling, delayed SWRT is provided to schedule multiple jobs according to their priorities. The simulations in real topologies show that RoWAN reduces global model synchronization communication time of single iteration by up to 15.54%-48.2% on average in comparison with the traditional solutions. Compared to other three inter-job scheduling approaches, delayed SWRT can reduce the weighted job completion time (WJCT) by about 60%, 44.8% and 28.76%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Job scheduling for distributed machine learning in optical WAN

Abstract

Talk to us

Similar Papers

More From: Future Generation Computer Systems

Lead the way for us

Journal: Future Generation Computer Systems	Publication Date: Jun 9, 2020
Citations: 10

Similar Papers

Accelerating model synchronization for distributed machine learning in an optical wide area network
Ling Liu ... Xi Chen
Journal of Optical Communications and Networking | VOL. 14
Ling Liu, et. al.Ling Liu ... Xi Chen
27 Sep 2022
Journal of Optical Communications and Networking | VOL. 14

3R Regeneration in Elastic Optical Networks and its Impact on the Network Quality of Service
Danilo Borquez-Paredes ... Ariel Leiva
-
Danilo Borquez-Paredes, et. al.Danilo Borquez-Paredes ... Ariel Leiva
01 Jul 2020
01 Jul 2020

Traffic scheduling in a photonic packet switching system with QoS guarantee
Bo Li ... Yang Qin
Journal of Lightwave Technology | VOL. 16
Bo Li, et. al. Bo Li ... Yang Qin
01 Jan 1998
Journal of Lightwave Technology | VOL. 16

SWAN: End-to-end orchestration for cloud network and WAN
Haiyang Qian ... Clark Chen
-
Haiyang Qian, et. al.Haiyang Qian ... Clark Chen
01 Nov 2013
01 Nov 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Job scheduling for distributed machine learning in optical WAN

Abstract

Talk to us

Similar Papers

More From: Future Generation Computer Systems