Abstract

Ensemble neural networks are widely applied in cloud-based inference services due to their remarkable performance, while the growing demand for low-latency services leads researchers to pay more attention to the execution efficiency of these models, especially the device utilization. It is highly desirable to fully utilize GPUs by multiplexing different inference tasks on the same GPU with advanced sharing technique, such as Multi-Process-Service (MPS). However, we find it struggling when applying MPS to Ensemble Neural Networks, which consist of multiple related sub-models. The critical challenge in this predicament revolves around the efficient allocation of resources within an ensemble, aiming to minimize job completion time.To tackle this challenge, we initially examine the interplay among individual neural networks within an ensemble, outlining a guideline for achieving the shortest job completion time. Subsequently, we establish a mathematical model to formalize the resource requirements of each neural network. We introduce a search-based allocation algorithm designed to swiftly identify optimal solutions. Finally, we introduce ESEN, comprising the search-based resource allocation algorithm and efficient model execution mechanisms within PyTorch. ESEN is augmented with customized execution mechanisms for user-friendly implementation. Experimental results demonstrate that proposed ESEN can attain an efficiency improvement up to 17.84% and a GPU utilization increase of 28.09% compared to the default strategy. With the optimization of GPU resource allocation, ESEN significantly improves the efficiency of ensemble models. It provides a low-latency and high-accuracy solution for online interactive services.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call