Abstract
Large companies operate tens of data centers (DCs) across the globe to serve their customers and store data. On the other hand, many machine learning applications need a global view of such global data to pursue high model accuracy. However, for this Geo-distributed machine learning (Geo-DML), it is infeasible to move all data together over wide-area networks (WANs) due to scarce WAN bandwidth, privacy concerns and data sovereignty laws. Therefore, most Geo-DML systems leverage geo-distributed approaches to train models, where global model synchronization is required between DCs over WAN. With the rapid increase of training data and the model sizes, it is challenging to efficiently utilize scarce and heterogeneous WAN bandwidth to synchronize models. With the advancement of optical technology, network topology becomes reconfigurable in optical WAN, which brings a new opportunity for Geo-DML training over WAN.We propose to optimize Geo-DML training with centralized joint control of the network and reconfigurable optical layers. We respectively prove the intra-job and inter-job scheduling problems are NP-hard and strongly NP-hard. For intra-job scheduling, RoWAN based on deterministic rounding algorithm, is presented to dynamically change the topology by reconfiguring the optical devices, and allocate path and rate for each flow. For inter-job scheduling, delayed SWRT is provided to schedule multiple jobs according to their priorities. The simulations in real topologies show that RoWAN reduces global model synchronization communication time of single iteration by up to 15.54%-48.2% on average in comparison with the traditional solutions. Compared to other three inter-job scheduling approaches, delayed SWRT can reduce the weighted job completion time (WJCT) by about 60%, 44.8% and 28.76%.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.