Abstract

Distributed deep learning frameworks facilitate large deep learning workloads. These frameworks support sharing one GPU device among multiple jobs to improve resource utilization. Modern deep learning training jobs consume a large amount of GPU memory. Despite that, sharing GPU memory among jobs is still possible because a training job has iterative steps that its memory usage fluctuates over time. However, resource sharing also introduces the risk of job performance degradation. Co-located jobs sharing a GPU device may suffer from different levels of interference, mainly caused by memory oversharing. How to improve resource utilization while maintaining good job performance is a novel challenge for job placement strategies. This paper studies the job placement problem. We propose an opportunistic memory sharing model to describe the time-varying job memory requirements. Based on this model, we introduce an Opportunistic Job Placement Problem (OJPP) for shared GPU clusters that seek job placement configurations using a minimum number of GPU devices and guarantee user-defined performance requirements at the same time. We propose a greedy algorithm and a heuristic algorithm with computational complexities of O(nlog⁡n) and O(n2log⁡n), respectively, to solve the problem. We also propose an online adjustment algorithm with the computational complexity of O(nlog⁡n) to perform updates to job placement configurations in runtime. A machine-learning-based interference prediction method is used to prepare accurate interference estimations. Extensive experiments are conducted on a GPU cluster to verify the correctness and effectiveness of our algorithms. Compared with standalone training jobs on dedicated clusters, the proposed approach reduces resource consumption by 46% in a shared cluster, while guaranteeing over 92.97% of the job performance, in terms of average job completion time.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call