Edge inference and other compute-intensive industrial Internet of Things (IIoT) applications suffer from a bad quality of experience due to the limited and heterogeneous computing and communication resources of embedded devices. To tackle these issues, we propose a model partitioning-based self-aware collaborative edge inference framework. Specifically, the device can adaptively adjust the local model inference scheme by sensing the available computing and communication resources of surrounding devices. When the inference latency requirement cannot be met by local computation, the model should be partitioned for collaborative computation on other devices to improve the inference efficiency. Furthermore, for two typical IIoT scenarios, i.e., bursting and stacking tasks, the latency-aware and throughput-aware collaborative inference algorithms are designed, respectively. Via jointly optimizing the partition layer and collaborative device selection, the optimal inference efficiency, characterized by minimum inference latency and maximum inference throughput, can be obtained. Finally, the performance of our proposal is validated through extensive simulations and tests conducted on 10 Raspberry Pi 4Bs using popular models. Specifically, in the case of two collaborative devices, our platform reaches up to 92.59% latency reduction for bursting tasks and 16.19× throughput growth for stacking tasks. In addition, the divergence between simulations and tests ranges from 1.64% to 9.56% for bursting tasks and from 3.24% to 11.24% for stacking tasks, which indicates that the theoretical performance analyses are solid. For the general case where the data privacy is not considered and the number of collaborative devices is optimally determined, up to 14.76× throughput speed up and 84.04% latency reduction can be obtained.
Read full abstract