Collaborative Intelligence For Vision Transformers: A Token Sparsity-Driven Edge-Cloud Framework
Collaborative Intelligence (CI) streamlines DNN deployment in edge-cloud infrastructure by optimizing workload between the edge and the cloud. It leverages data sparsity for both data compression for cloud transmission and the reduction of the overall computational cost of the system. Despite the rising popularity of the Vision Transformer (ViT), its higher computational overhead poses challenges for edge-cloud deployment compared to CNNs. Existing CI methods favor CNNs, utilizing feature map sparsity. In contrast, ViTs exploit token-based sparsity, complicating the direct application of CNN-optimized CI methods. This motivates us to propose a novel CI approach exploiting the token sparsity of ViT. We propose an offloading policy network, which computes scores that reveal the tokens’ relevance to the task before inference, effectively using sparse tokens in the offloaded data improving compression rate and computational cost for an edge-cloud system. Our method shows 41.98-45.75% computational cost reduction of ViT while maintaining accuracy degradation within 1.96-3.10 points and achieving a compression rate up to $36.85 \%$.