Online Placement and Scaling of Geo-Distributed Machine Learning Jobs via Volume-Discounting Brokerage

Xiaotong Li,Ruiting Zhou,Zongpeng Li,Chuan Wu,Lei Jiao,Yuhang Deng

doi:10.1109/tpds.2019.2955935

Abstract

Geo-distributed machine learning (ML) often uses large geo-dispersed data collections produced over time to train global models, without consolidating the data to a central site. In the parameter server architecture, “workers” and “parameter servers” for a geo-distributed ML job should be strategically deployed and adjusted on the fly, to allow easy access to the datasets and fast exchange of the model parameters at any time. Despite many cloud platforms now provide volume discounts to encourage the usage of their ML resources, different geo-distributed ML jobs that run in the clouds often rent cloud resources separately and respectively, thus rarely enjoying the benefit of discounts. We study an ML broker service that aggregates geo-distributed ML jobs into cloud data centers for volume discounts via dynamic online placement and scaling of workers and parameter servers in individual jobs for long-term cost minimization. To decide the number and the placement of workers and parameter servers, we propose an efficient online algorithm which first decomposes the online problem into a series of one-shot optimization problems solvable at each individual time slot by the technique of regularization, and afterwards round the fractional decisions to the integer ones via a carefully-designed dependent rounding method. We prove a parameterized-constant competitive ratio for our online algorithm as the theoretical performance analysis, and also conduct extensive simulation studies to exhibit its close-to-offline-optimum practical performance in realistic settings.

Full Text