Abstract

GPUs used in data centers to perform deep learning inference tasks are underutilized. Previous systems tended to deploy a single model on a GPU to ensure that inference tasks met throughput and latency requirements. The rapid increase in one GPU’s resources, as well as the emergence of scenarios such as small models and small batches, have exacerbated the issue of low GPU utilization. In this case, a mixed model deployment-based solution can significantly improve GPU utilization while also providing greater flexibility to the inference system’s upper layer. The selection of model combinations and optimization strategies in mixed model deployment, however, remain unresolved issues. This paper proposes Optimum, the first model-combination planning and runtime optimization framework for mixed model deployment. Facing enormous search spaces, Optimum uses performance prediction for model combination selection with low search overhead. The predictor is based on a multilayer perceptron. Its input features are the profiling results of the model engine, and the output is the performance degradation. The runtime optimization strategies allow Optimum to perform performance optimization and fine-grained tradeoff. The Optimum prototype is based on CUDA multi-stream and TensorRT. The test results show that we have a flat 10.3% performance improvement over mainstream single-model deployments. We have a performance improvement of up to 7.09% over the state-of-the-art with an order of magnitude reduction in search overhead.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.