Abstract

As Internet-of-Things (IoT) keeps growing, IoT-side intelligence services, such as intelligent personal assistant, healthcare surveillance, and smart home service, offload more and more complex machine learning (ML) inference workloads to cloud clusters. GPUs have been widely adopted to accelerate the execution of these ML inference workloads. However, current cluster management systems guarantee low tail latency for ML inferences using resource over-provisioning and small batch sizes, resulting in a serious waste of GPU resources and increasing the service costs greatly. To mitigate poor GPU utilization, we present AutoInfer, a self-driving cluster management system for ML inference serving in GPU clusters, where users express only the latency and accuracy requirements for their workloads without needing to specify the model variant, GPU provisioning strategy, and batching mechanism. AutoInfer extends the matrix factorization model to automatically recommend model variants for each new incoming ML inference workload with respect to latency and accuracy requirements, by identifying similarities to previously scheduled workloads. During runtime, AutoInfer leverages online telemetry data and deep reinforcement learning to adaptively adjust the GPU allocation and batch size to account for load variations while minimizing the effects on tail latency Service Level Objectives (SLOs). Testbed experiments show that AutoInfer is able to improve the average GPU utilization by up to 77% and keep the tail latency SLO violations under 5.5%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call