The Case for Hard Matrix Multiplier Blocks in an FPGA

Aman Arora,Zhigang Wei,Lizy John

doi:10.1145/3373087.3375360

Abstract

Designing efficient hardware for accelerating machine learning (ML) applications is a major challenge. Rapid changing algorithms and network architectures in this field make FPGA based designs an attractive solution. But the generic building blocks available in current FPGAs (ALMs/CLBs, DSP blocks) limit the acceleration that can be achieved. We propose a modification to the current FPGA architecture that makes FPGAs specialized for ML applications. Specifically, we propose adding hard matrix multiplier blocks (matmuls) into the FPGA fabric. These matmuls are implemented using systolic arrays of MACs (Multiply-And-Accumulate) and can be connected using programmable direct interconnect between neighboring matmuls to make larger systolic matrix multipliers. We explore various matmul sizes (4x4x4, 8x8x8, 16x16x16, 32x32x32) and various strategies to place these blocks on the FPGA (clustered, surround, columnar). We recommend 4x4x4 matmul blocks with columnar placement after studying tradeoffs between area, frequency, fragmentation and channel width. Experimental results and analytical evaluation reveal that providing matmuls in an FPGA speeds up state-of-the-art neural networks (Resnet50, GNMT, Transformer, Minigo) by ~2.5x on average, compared to a DSP-heavy FPGA with equal number of MACs. Therefore, FPGAs with hard matrix multipliers can be used to design faster, more area (and hence, power) efficient hardware accelerators for ML applications, compared to current FPGAs, at the cost of reducing the flexibility of the FPGA for other applications. A matmul-heavy FPGA fabric could be a part of bigger FPGA, the rest of which can have general programmable logic, or fully ML-specific FPGAs with matmuls could be created.

Full Text