Optimum: Runtime optimization for multiple mixed model deployment deep learning inference

Kaicheng Guo,Yixiao Xu,Zhengwei Qi,Haibing Guan

doi:10.1016/j.sysarc.2023.102901

Abstract

GPUs used in data centers to perform deep learning inference tasks are underutilized. Previous systems tended to deploy a single model on a GPU to ensure that inference tasks met throughput and latency requirements. The rapid increase in one GPU’s resources, as well as the emergence of scenarios such as small models and small batches, have exacerbated the issue of low GPU utilization. In this case, a mixed model deployment-based solution can significantly improve GPU utilization while also providing greater flexibility to the inference system’s upper layer. The selection of model combinations and optimization strategies in mixed model deployment, however, remain unresolved issues. This paper proposes Optimum, the first model-combination planning and runtime optimization framework for mixed model deployment. Facing enormous search spaces, Optimum uses performance prediction for model combination selection with low search overhead. The predictor is based on a multilayer perceptron. Its input features are the profiling results of the model engine, and the output is the performance degradation. The runtime optimization strategies allow Optimum to perform performance optimization and fine-grained tradeoff. The Optimum prototype is based on CUDA multi-stream and TensorRT. The test results show that we have a flat 10.3% performance improvement over mainstream single-model deployments. We have a performance improvement of up to 7.09% over the state-of-the-art with an order of magnitude reduction in search overhead.

Full Text