With operator-level optimized complier and accelerator back-ends, TVM-VTA stack is a reconfigurable, hardware/software collaborative tensor accelerator. However, VTA architecture based on HLS compilation could not make full use of hardware resources, there is still much room for optimization. Specifically, the hardware throughput cannot utilize the memory bandwidth well, which leads to the performance bottleneck. Therefore, we propose Enhanced VTA of paralleled channel through RTL-HLS hybrid templates, which is compatible with the full stack framework. The VTA memory access microarchitecture is redesigned and optimized by combining the hardware platform resources, to realize the paralleled loading of feature map and weight data with bandwidth resources fully used. Based on Xilinx ZCU104 development board, the software and hardware working environment is built, network of YOLOV3-Tiny, YOLOV3 are deployed. The peak computing power can reach up to 361GOP/s at frequency of 200 MHz, which is 99.64x than original VTA on PYNQ Z1 platform. The performance of YOLOV3-Tiny reaches the highest compared with public results on TVM community. The overall performance of YOLOV3 based on TVM-VTA is proposed for the first time, the normalized operation is 2.2x speedup of NVDLA. The performance of speedup and power efficiency have advantages among different designs of full-stack accelerators.
Read full abstract