Sustaining real-time, high fidelity AI-based vision perception on edge devices is challenging due to both the high computational overhead of increasingly “deeper” Deep Neural Networks (DNNs) and the increasing resolution/quality of camera sensors. Such high-throughput vision perception is even more challenging in multi-tenancy systems, where video streams from multiple such high-quality cameras need to share the same GPU resource on a single edge device. Criticality-aware canvas-based processing is a promising paradigm that decomposes multiple concurrent video streams into Regions of Interest (RoI) and spatially channels the limited computational resources to selected RoI with higher “resolution”, thereby moderating the trade-off between computational load, task fidelity, and processing throughput. RA-MOSAIC (Resource Adaptive MOSAIC) employs such canvas-based processing, while further tuning the incoming video streams and available resources on-demand to allow the system to adapt to dynamic changes in workload (often arising from variations in the number or size of relevant objects observed by individual cameras). RA-MOSAIC utilizes two distinct and synergistic concepts. First, at the camera sensor, a bandwidth-adaptive and lightweight Bandwidth Aware Camera Transmission (BACT) method applies differential down-sampling to create mixed-resolution individual frames that preferentially preserve resolution for critical ROIs, before being transmitted to the edge node. Second, at the edge, BACT video streams received from multiple cameras are decomposed into multi-scale RoI tiles and spatially packed using a novel workload-adaptive bin-packing strategy into a single ‘canvas frame’. Notably, the canvas frame itself is dynamically sized such that the edge device can opportunistically provide higher processing throughput for selected high-priority tiles during periods of lower aggregate workloads. To demonstrate RA-MOSAIC’s gains in processing throughput and perception fidelity, we evaluate RA-MOSAIC on a single NVIDIA Jetson TX2 edge device for two benchmark tasks: Drone-based Pedestrian Detection and Automatic License Plate Recognition. In a bandwidth-constrained wireless environment, RA-MOSAIC employs a batch size of 1 to pack up to 6 concurrent video streams on a dynamically sized canvas frame to provide (i) 14.3% gain in object detection accuracy and (ii) 11.11% gain in throughput on average (up to 20 FPS per camera, cumulatively 120 FPS), over our previous work MOSAIC, a naïve canvas-based baseline. Compared to prior state of the art baselines such as batched inference over extracted RoI, RA-MOSAIC provides a very-significant, 29.6% gain in accuracy for a comparable throughput. Similarly, RA-MOSAIC dramatically outperforms bandwidth adaptive baselines, such as FCFS ( \(\leq 1\%\) accuracy gain but \(5.6\) x or 566.67% throughput gain) and uniform grid packing (17% accuracy improvement and 5% throughput gain).
Read full abstract