Abstract
As Internet-of-Things (IoT) keeps growing, IoT-side intelligence services, such as intelligent personal assistant, healthcare surveillance, and smart home service, offload more and more complex machine learning (ML) inference workloads to cloud clusters. GPUs have been widely adopted to accelerate the execution of these ML inference workloads. However, current cluster management systems guarantee low tail latency for ML inferences using resource over-provisioning and small batch sizes, resulting in a serious waste of GPU resources and increasing the service costs greatly. To mitigate poor GPU utilization, we present AutoInfer, a self-driving cluster management system for ML inference serving in GPU clusters, where users express only the latency and accuracy requirements for their workloads without needing to specify the model variant, GPU provisioning strategy, and batching mechanism. AutoInfer extends the matrix factorization model to automatically recommend model variants for each new incoming ML inference workload with respect to latency and accuracy requirements, by identifying similarities to previously scheduled workloads. During runtime, AutoInfer leverages online telemetry data and deep reinforcement learning to adaptively adjust the GPU allocation and batch size to account for load variations while minimizing the effects on tail latency Service Level Objectives (SLOs). Testbed experiments show that AutoInfer is able to improve the average GPU utilization by up to 77% and keep the tail latency SLO violations under 5.5%.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.