Colocating multiple jobs on the same server has been widely applied for improving resource utilization in cloud datacenters. However, the colocated jobs would contend for the shared resources, which could lead to significant performance degradation. An efficient approach for eliminating performance interference is to partition the shared resources among the colocated jobs. However, this makes the resource management in datacenters very challenging. In this paper, we propose JointOPT, the first resource management framework that optimizes job assignment and resource partitioning jointly for improving the throughput of cloud datacenters. JointOPT uses a local search based algorithm to find the near optimal job assignment configuration, and uses a deep reinforcement learning (DRL) based approach to dynamically partition the shared resources among the colocated jobs. In order to reduce the interaction overhead with real system, it leverages deep learning to estimate job performance without running them on real servers. We conduct extensive experiments to evaluate JointOPT and the results show that JointOPT significantly outperform the state-of-the-art baselines, with an advantage from 13.3% to 47.7%.
Read full abstract