Multi-Compression Scale DNN Inference Acceleration based on Cloud-Edge-End Collaboration

Huamei Qi,Fang Ren,Shaohua Wan,Xiaoheng Deng,Ping Jiang,Leilei Wang

doi:10.1145/3634704

Abstract

Edge intelligence has emerged as a promising paradigm to accelerate DNN inference by model partitioning, which is particularly useful for intelligent scenarios that demand high accuracy and low latency. However, the dynamic nature of the edge environment and the diversity of end devices pose a significant challenge for DNN model partitioning strategies. Meanwhile, limited resources of the edge server make it difficult to manage resource allocation efficiently among multiple devices. In addition, most of the existing studies disregard the different service requirements of the DNN inference tasks, such as its high accuracy-sensitive or high latency-sensitive. To address these challenges, we propose a Multi-Compression Scale DNN Inference Acceleration (MCIA) based on cloud-edge-end collaboration. We model this problem as a mixed-integer multi-dimensional optimization problem, jointly optimizing the DNN model version choice, the partitioning choice, and the allocation of computational and bandwidth resources to maximize the tradeoff between inference accuracy and latency depending on the property of the tasks. Initially, we train multiple versions of DNN inference models with different compression scales in the cloud, and deploy them to end devices and edge server. Next, a deep reinforcement learning-based algorithm is developed for joint decision making of adaptive collaborative inference and resource allocation based on the current multi-compression scale models and the task property. Experimental results show that MCIA can adapt to heterogeneous devices and dynamic networks, and has superior performance compared with other methods.

Full Text