Abstract

The attention mechanism has recently shown superior performance in natural language processing and computer vision tasks. But its complex dataflow and large-scale matrix calculation with huge computing and memory overhead pose a great challenge for the design of hardware accelerators. And previous solutions that benefited from matrix partitioning are bounded by the softmax function. In this paper, we propose a new attention framework that can dramatically improve the performance of attention model inference for long sequence tasks on FPGAs. We design a novel accelerator architecture that employs two systolic arrays and a ping-pong structure to accelerate attention calculation. Meanwhile, we propose an analytical model to predict resource usage and performance, which guides a fast design space exploration. Experiments using the state-of-the-art BERT demonstrate the design achieves 4.61 and 1.24× improvement in speed and energy efficiency compared to CPU and GPU on the Xilinx XCZU11EG platform.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call