HeteroFlow

Shaojie Xiang,Yi-Hsiang Lai,Debjit Pal,Zhiru Zhang,Niansong Zhang,Hongzheng Chen,Yuan Zhou

doi:10.1145/3490422.3502369

Abstract

To achieve high performance with FPGA-equipped heterogeneous compute systems, it is crucial to co-optimize data placement and compute scheduling to maximize data reuse and bandwidth utilization for both on- and off-chip memory accesses. However, optimizing the data placement for FPGA accelerators is a complex task. One must acquire in-depth knowledge of the target FPGA device and its associated memory system in order to apply a set of advanced optimizations. Even with the latest high-level synthesis (HLS) tools, programmers often have to insert many low-level vendor-specific pragmas and substantially restructure the algorithmic code so that the right data are accessed at the right loop level using the right communication schemes. These code changes can significantly compromise the composability and portability of the original program. To address these challenges, we propose HeteroFlow, an FPGA accelerator programming model that decouples the algorithm specification from optimizations related to orchestrating the placement of data across a customized memory hierarchy. Specifically, we introduce a new primitive named .to(), which provides a unified programming interface for specifying data placement optimizations at different levels of granularity: (1) coarse-grained data placement between host and accelerator, (2) medium-grained kernel-level data placement within an accelerator, and (3) fine-grained data placement within a kernel. We build HeteroFlow on top of the open-source HeteroCL DSL and compilation framework. Experimental results on a set of realistic benchmarks show that, programs written in HeteroFlow can match the performance of extensively optimized manual HLS design with much fewer lines of code.

Full Text