Accelerating Visual Neural Networks in Edge Computing environments is crucial for processing image and video data. Visual Neural Networks, including Convolutional Neural Networks and Vision Transformers, are central to image recognition, video analysis, and object detection tasks. Deploying these networks on edge devices and accelerating them can significantly enhance data processing speed and efficiency. The large number of parameters, complex computational flows, and various structural variants of Transformer models present both opportunities and challenges. We propose Vis-TOP (Vision Transformer Overlay Processor), an overlay processor designed for all types of Vision Transformer models. Vis-TOP, distinct from coarse-grained general-purpose accelerators like GPUs and fine-grained custom designs, encapsulates Vision Transformer characteristics into a three-layer, two-level mapping structure, enabling flexible model switching without hardware architecture modifications. Concurrently, we designed a corresponding instruction bundle and hardware architecture within this mapping structure. We implemented the overlay processor design on the ZCU102 after quantizing the Swin Transformer model to 8-bit fixed points (fix_8). Experimentally, our throughput surpasses GPU implementation by 1.5 times. Our throughput per DSP is 2.2 to 11.7 times higher than that of existing Transformer-like accelerators. Overall, our approach satisfies real-time AI requirements in resource consumption and inference speed. Vis-TOP offers a cost-effective image processing solution for Edge Computing on reconfigurable devices, enhancing computational resource utilization, saving data transfer time and costs, and reducing latency.
Read full abstract