Abstract

GPUs are essential to accelerating the latency-sensitive deep neural network (DNN) inference workloads in cloud datacenters. To fully utilize GPU resources, <i>spatial sharing</i> of GPUs among co-located DNN inference workloads becomes increasingly compelling. However, GPU sharing inevitably brings <i>severe performance interference</i> among co-located inference workloads, as motivated by an empirical measurement study of DNN inference on EC2 GPU instances. While existing works on guaranteeing inference performance service level objectives (SLOs) focus on either <i>temporal sharing</i> of GPUs or <i>reactive</i> GPU resource scaling and inference migration techniques, how to <i>proactively</i> mitigate such severe performance interference has received comparatively little attention. In this paper, we propose <i>iGniter</i> , an <i>interference-aware</i> GPU resource provisioning framework for cost-efficiently achieving predictable DNN inference in the cloud. <i>iGniter</i> is comprised of two key components: (1) a <i>lightweight</i> DNN inference performance model, which leverages the system and workload metrics that are practically accessible to capture the performance interference; (2) A <i>cost-efficient</i> GPU resource provisioning strategy that <i>jointly</i> optimizes the GPU resource allocation and adaptive batching based on our inference performance model, with the aim of achieving predictable performance of DNN inference workloads. We implement a prototype of <i>iGniter</i> based on the NVIDIA Triton inference server hosted on EC2 GPU instances. Extensive prototype experiments on four representative DNN models and datasets demonstrate that <i>iGniter</i> can guarantee the performance SLOs of DNN inference workloads with practically acceptable runtime overhead, while saving the monetary cost by up to <inline-formula><tex-math notation="LaTeX">$25\%$</tex-math></inline-formula> in comparison to the state-of-the-art GPU resource provisioning strategies.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call