Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers through Microbenchmarking

Alec Lu,Lesley Shannon,Zhenman Fang,Weihua Liu

doi:10.1145/3431920.3439284

Abstract

With the public availability of FPGAs from major cloud service providers like AWS, Alibaba, and Nimbix, hardware and software developers can now easily access FPGA platforms. However, it is nontrivial to develop efficient FPGA accelerators, especially for software programmers who use high-level synthesis (HLS). The major goal of this paper is to figure out how to efficiently access the memory system of modern datacenter FPGAs in HLS-based accelerator designs. This is especially important for memory-bound applications; for example, a naive accelerator design only utilizes less than 5% of the available off-chip memory bandwidth. To achieve our goal, we first identify a comprehensive set of factors that affect the memory bandwidth, including 1) the number of concurrent memory access ports, 2) the data width of each port, 3) the maximum burst access length for each port, and 4) the size of consecutive data accesses. Then we carefully design a set of HLS-based microbenchmarks to quantitatively evaluate the performance of the Xilinx Alveo U200 and U280 FPGA memory systems when changing those affecting factors, and provide insights into efficient memory access in HLS-based accelerator designs. To demonstrate the usefulness of our insights, we also conduct two case studies to accelerate the widely used K-nearest neighbors (KNN) and sparse matrix-vector multiplication (SpMV) algorithms. Compared to the baseline designs, optimized designs leveraging our insights achieve about 3.5x and 8.5x speedups for the KNN and SpMV accelerators.

Full Text