Resource Allocation for Multiuser Edge Inference With Batching and Early Exiting

Zhiyan Liu,Qiao Lan,Kaibin Huang

doi:10.1109/jsac.2023.3242724

Abstract

The deployment of inference services at the network edge, called edge inference, offloads computation-intensive inference tasks from mobile devices to edge servers, thereby enhancing the former’s capabilities and battery lives. In a multiuser system, the joint allocation of communication-and-computation (C2) resources (i.e., scheduling and bandwidth allocation) is made challenging by adopting efficient inference techniques, batching and early exiting, and further complicated by the heterogeneity in users’ requirements on accuracy and latency. Batching groups multiple tasks into a single batch for parallel processing to reduce time-consuming memory access and thereby boosts the throughput (i.e., completed task per second). On the other hand, early exiting allows a task to exit from a deep-neural network without traversing the whole network, thereby supporting a tradeoff between accuracy and latency. In this work, we study optimal C2 resource allocation with batching and early exiting, which is an NP-complete integer programming problem. A set of efficient algorithms are designed under the criterion of maximum throughput by tackling the challenge. First, consider the case with batching but without early exiting. The target problem is solved optimally using a proposed best-shelf-packing algorithm that nests a threshold-based scheme, which selects users with the best channels and meeting the computation-time constraints, in a sequential search for the maximum batch size. Next, consider the general case with batching and early exiting. A low-complexity sub-optimal algorithm for C2 resource allocation is developed by modifying the preceding algorithm to exploit early exiting for latency reduction. On the other hand, the optimal approach is developed based on nesting a depth-first tree-search with intelligent online pruning into a sequential search for the maximum batch size. The key idea is to derive pruning criteria based on the simple greedy solution for the target problem without a bandwidth constraint and apply the result to designing an intelligent online pruning scheme. Experimental results demonstrate that both optimal and sub-optimal C2 resource allocation algorithms can leverage integrated batching and early exiting to double the inference throughput compared with conventional schemes.

Full Text