Online Scheduling of Distributed Machine Learning Jobs for Incentivizing Sharing in Multi-Tenant Systems

Ne Wang,Ling Han,Ruiting Zhou,Zongpeng Li,Hao Chen

doi:10.1109/tc.2022.3174566

Abstract

To save cost, companies usually train machine learning (ML) models on a shared multi-tenant system. In this cooperative environment, one of the fundamental challenges is how to distribute resources fairly among tenants such that each tenant is satisfied. A satisfactory allocation policy needs to meet the following properties. First, the performance of each tenant in the shared cluster is at least the same as that in its exclusive cluster partition. Second, no tenant can get more benefits by lying about its demands. Third, tenants cannot use the idle resources of others for free. Moreover, the resource allocation for ML workloads should avoid costly migration overhead. To this end, we propose a three-layer scheduling framework Astraea: i) a batch scheduling framework groups unprocessed jobs into multiple batches; ii) a round-by-round algorithm enables tenants to reserve their share of resources and schedule jobs in a non-preemptive manner; iii) one-round algorithm based on primal-dual approach and posted pricing framework, which encourages tenants to report truthful demands. Astraea is proven to achieve performance guarantee and some desirable properties of sharing, including sharing incentive, strategy-proofness and gain-as-you-contribute fairness. Extensive trace-driven simulations show Astraea advances in both fairness and cluster efficiency compared to three state-of-the-art baselines.

Full Text