High Throughput VMs Placement With Constrained Communication Overhead and Provable Guarantees

Yaniv Sa’Ar,Gil Einziger,Gabriel Scalosub,Itamar Cohen,Maayan Goldstein,Erez Waisbard

doi:10.1109/tnsm.2023.3238644

Abstract

Placement of VMs in the cloud is one of the most fundamental problems in systems research. Traditionally, placement algorithms assume that the schedulers have complete information about the currently available resources at each host. However, this assumption is in many cases unrealistic, as gathering fresh status information from each of the thousands of hosts in a large data center incurs excessive communication overhead, which results in long queueing delays. Efforts to resolve this problem by employing several parallel schedulers typically exhibit collisions when several schedulers are simultaneously trying to place VMs on the same host. Our work analyzes the performance of various placement algorithms and provides empirical evidence that using multiple randomized schedulers obtains high throughput, while significantly decreasing both the communication overhead, and the number of collisions between schedulers. We, therefore, introduce Adaptive Partial State Random (APSR) – an efficient parallel random resource management algorithm that samples only from a small number of hosts and dynamically adjusts the degree of parallelism to provide provable guarantees on the probability of collisions between distinct schedulers. We formally analyze APSR, evaluate it on real workloads, and integrate it into the popular OpenStack cloud management platform. Our evaluation shows that APSR matches the throughput provided by other parallel schedulers, while achieving up to 13x lower decline ratio and a reduction of over 85% in communication overheads.

Full Text