In the AI era, the emergence of the transformer model contributes to a significant shift in the natural language processing field. Its derivative, the Vision Transformer (ViT), adapts these principles for image recognition and demonstrates superior performance over traditional Convolutional Neural Networks (CNNs). Despite its excellent performance, deploying these models on edge devices is impeded by their extensive computational demands and large memory requirements, which poses challenges for the limited resources and real-time processing needs at the edge. Hence, it is necessary to develop a new hardware accelerator to optimize ViT architecture. This paper reviews the development of Field-Programmable Gate Array (FPGA)-based ViT inference accelerators, focusing on their architectures and applications in solving the dilemmas of ViT model deployment. Additionally, it explores the optimization approaches on both algorithm and hardware and traces the advancements in deploying AI models at the edge using FPGAs. It summarizes the current trends in research on FPGA-based ViT accelerators and provides insights into future directions for innovations in hardware-accelerated AI. Generally, by tracing the related works for FPGA-based ViT inference accelerator optimization, this article presents a useful snapshot of current research in ViT hardware accelerators and contributes to clarifying future research directions in this area.
Read full abstract