Abstract

With the popularity of Deep Neural Network (DNN) models in diverse fields, DNN inference services have been widely deployed on cloud for resource-limited devices to support intelligent applications. Serving DNN inference often requires GPU acceleration to meet latency-sensitive interactive targets. A common approach to improving GPU utilization is to let multiple models share a GPU. However, this approach probably degrades both the responsiveness and throughput of model serving systems, due to that concurrent DNN inference tasks contend for GPU resources. In addition, interferences among heterogeneous DNN inference tasks probably incur performance isolation problem that heterogeneous models suffer from different levels of serving performance degradation. Existing works fail to ensure performance isolation among users of heterogeneous DNN models. To solve the aforementioned problem, we propose InferFair, a QoS-aware scheduling framework for ensuring performance isolation in heterogeneous model serving systems. InferFair focuses on two key designs: (1) periodically estimating effective throughput requirements of all active models online and (2) applying fine-grained adjustments to minimize the impact differences of GPU sharing on heterogeneous model services. We conduct intensive experiments on a variety of DNN models to demonstrate the effectiveness of InferFair. Compared to a prior competitor named Clockwork, InferFair alleviates the performance isolation problem by up to 1.7×, as well as improving the overall goodput by up to 25.6%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call