Evaluating a New Attention Framework Based on Matrix Blocking for Attention Models on FPGAs

Xiaohang Liu,Jinwei Xu,Jingfei Jiang,Lei Gao

doi:10.1109/ictai56018.2022.00095

Abstract

The attention mechanism has recently shown superior performance in natural language processing and computer vision tasks. But its complex dataflow and large-scale matrix calculation with huge computing and memory overhead pose a great challenge for the design of hardware accelerators. And previous solutions that benefited from matrix partitioning are bounded by the softmax function. In this paper, we propose a new attention framework that can dramatically improve the performance of attention model inference for long sequence tasks on FPGAs. We design a novel accelerator architecture that employs two systolic arrays and a ping-pong structure to accelerate attention calculation. Meanwhile, we propose an analytical model to predict resource usage and performance, which guides a fast design space exploration. Experiments using the state-of-the-art BERT demonstrate the design achieves 4.61 and 1.24× improvement in speed and energy efficiency compared to CPU and GPU on the Xilinx XCZU11EG platform.

Full Text