Abstract
Systolic array architecture is widely used in spatial hardware and well-suited for many tensor processing algorithms. Many systolic array architectures are implemented with high-level synthesis (HLS) design flow. However, existing HLS tools do not favor of modular and reusable design, which brings inefficiency for design iteration. In this article, we analyze the systolic array design space, and identify the common structures of different systolic dataflows. We build hardware module templates using Chisel infrastructure, which can be reused for different dataflows and computation algorithms. This remarkably improves the productivity for the development and optimization of systolic architecture. We further build a systolic array generator that transforms the tensor algorithm definition to a complete systolic hardware architecture. Experiments show that we can implement systolic array designs for different applications and dataflows with little engineering effort, and the performance throughput outperforms HLS designs.
Highlights
Systolic array architecture is widely used in spatial hardware and well-suited for many tensor processing algorithms
Experimental Setup We evaluate the performance and programming efficiency of our systolic generator with GEMM and other tensor applications, and compare the result of GEMM with several existing high-level synthesis (HLS)-based works.3;7;8 The systolic array designs are synthesized and implemented on Xilinx VU9P FPGA platform with Xilinx Vivado 2018.2
Our implementation uses Chisel’s ready-valid interface for data communication, which avoids the unified HLS programming interface that leads to extra data dependence, and the complex finite state machine generated by HLS compiler
Summary
Abstract—Systolic array architecture is widely used in spatial hardware and well-suited for many tensor processing algorithms. We build hardware module templates using Chisel infrastructure, which can be reused for different dataflows and computation algorithms. This remarkably improves the productivity for the development and optimization of systolic architecture. Experiments show that we can implement systolic array designs for different applications and dataflows with little engineering effort, and the performance throughput outperforms HLS designs. & TENSOR ALGEBRA IS a prevalent tool of modern computer applications and is increasingly deployed onto various embedded devices. Systolic array architecture that features with high computation parallelism and data reusability using an array of processing elements (PEs) are widely adopted in accelerator designs. Systolic architectures are used in many other applications like convolution, FFT, and matrix decomposition
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have