Convolution features huge complexity and demands high computation capability. Among hardware platforms, field programmable gate array (FPGA) emerges as a promising solution for its substantial available parallelism and energy efficiency. Besides, convolution can be implemented with different algorithms, including conventional, general matrix–matrix multiplication (GEMM), Winograd, and fast Fourier transformation (FFT) algorithms, which are diverse in arithmetic complexity, resource requirement, etc. Different convolutional neural network (CNN) models have different topologies and structures, favoring different convolution algorithms. In response, software libraries such as cuDNN provide a variety of computational primitives to support these algorithms. However, supporting such libraries on FPGAs is challenging. First, multiple algorithms can share the FPGA resources spatially as well as temporally, introducing either reconfiguration overhead or resource underutilization. Second, FPGA implementation remains a significant challenge for library developers. It typically requires significant specialized hardware knowledge. In this article, we propose <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">FCNNLib</monospace> , an efficient and scalable convolution algorithm library on FPGAs. To coordinate multiple convolution algorithms on FPGAs, we develop three schedulings: 1) spatial; 2) temporal; and 3) hybrid, which exhibit different tradeoffs in latency and throughput. We explore these schedulings by balancing the reconfiguration overhead, resource utilization, and optimization objectives of the CNNs. Then, we provide efficient and tunable algorithm templates that allow performance tuning through performance and resource models. To arm the users, <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">FCNNLib</monospace> exposes a set of interfaces to support high-level application designs. We demonstrate the usability of <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">FCNNLib</monospace> with state-of-the-art CNNs. <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">FCNNLib</monospace> achieves up to <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$44.6\times $ </tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.76\times $ </tex-math></inline-formula> energy efficiency in various scenarios compared with software libraries for CPUs and GPUs, respectively.
Read full abstract