Abstract

With the rapid proliferation of deep learning (DL) jobs running on heterogeneous GPUs, scheduling DL jobs to meet various scheduling requirements, such as meeting deadlines and reducing job completion time (JCT), is critical. Unfortunately, existing efficiency-oriented and deadline-aware efforts are still rudimentary. They lack the capability of scheduling jobs to meet deadline requirements while reducing total JCT, especially when the jobs have various execution times on heterogeneous GPUs. Therefore, we present Hydra, a novel quantitative cost comparison approach, to address this scheduling issue. Here, the cost represents the total JCT plus a dynamic penalty calculated from the total tardiness (i.e., the delay time of exceeding the deadline) of all jobs. Hydra adopts a sampling approach that exploits the inherent iterative periodicity of DL jobs to estimate job execution times accurately on heterogeneous GPUs. Then, Hydra considers various combinations of job sequences and GPUs to obtain the minimized cost by leveraging an efficient branch-and-bound algorithm. Finally, the results of evaluation experiments on Alibaba traces show that Hydra can reduce total tardiness by 85.8% while reducing total JCT as much as possible, compared with state-of-the-art efforts.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.