Workload consolidation is a widely used approach to enhance resource utilization in modern data centers. However, the concurrent execution of multiple jobs on a shared server introduces contention for essential shared resources such as CPU cores, Last Level Cache, and memory bandwidth. This contention negatively impacts job performance, leading to significant degradation in throughput. To mitigate resource contention, effective resource isolation techniques at the software or hardware level can be employed to partition the shared resources among colocated jobs. However, existing solutions for resource partitioning often assume a limited number of jobs that can be colocated, making them unsuitable for scenarios with a large-scale job colocation due to several critical challenges. In this study, we propose Lavender, a framework specifically designed for addressing large-scale resource partitioning problems. Lavender incorporates several key techniques to tackle the challenges associated with large-scale resource partitioning, ensuring efficiency, adaptivity, and optimality. We conducted comprehensive evaluations of Lavender to validate its performance and analyze the reasons for its advantages. The experimental results demonstrate that Lavender significantly outperforms state-of-the-art baselines. Lavender is publicly available at https://github.com/yanxiaoqi932/OpenSourceLavender.
Read full abstract