Abstract

Due to performance and energy requirements, FPGA-based accelerators have become a promising solution for high-performance computations. Meanwhile, with the help of high-level synthesis (HLS) compilers, FPGA can be programmed using common programming languages such as C, C++, or OpenCL, thereby improving design efficiency and portability. Stencil computations are significant kernels in various scientific applications. In this paper, we introduce an architecture design for implementing stencil kernels on state-of-the-art FPGA with high bandwidth memory (HBM). Traditional FPGAs are usually equipped with external memory, e.g., DDR3 or DDR4, which limits the design space exploration in the spatial domain of stencil kernels. Therefore, many previous studies mainly relied on exploiting parallelism in the temporal domain to eliminate the bandwidth limitations. In our approach, we scale-up the design performance by considering both the spatial and temporal parallelism of the stencil kernel equally. We also discuss the design portability among different HLS compilers. We use typical stencil kernels to evaluate our design on a Xilinx U280 FPGA board and compare the results with other existing studies. By adopting our method, developers can take broad parallelization strategies based on specific FPGA resources to improve performance.

Highlights

  • Over the past few years, offloading high-performance computing (HPC) applications to dedicated hardware accelerators has been a widely used solution [1,2]

  • To achieve the equal performance of GPGPUs, the existing studies mainly rely on employing the temporal parallelism of the stencil kernel to improve performance, thereby shifting the bottleneck of stencil computations from a memory bandwidth limitation to an FPGA hardware resource limitation

  • Suppose we only use temporal parallelism to achieve the same performance as in 4 × (2M + 1) stencil cells

Read more

Summary

Introduction

Over the past few years, offloading high-performance computing (HPC) applications to dedicated hardware accelerators has been a widely used solution [1,2]. To achieve the equal performance of GPGPUs, the existing studies mainly rely on employing the temporal parallelism of the stencil kernel to improve performance, thereby shifting the bottleneck of stencil computations from a memory bandwidth limitation to an FPGA hardware resource limitation Optimization strategies such as building on-chip sliding window buffers, replication and/or vectorization of computing units, and stream processing were discussed in these papers. This relies on the corresponding compiler to automatically partition memory resources to support parallel memory access [14,15], resulting in inefficient utilization of BRAM resource and redundant memory costs to scale the design performance with temporal parallelism This limits the design scalability to one spatial dimension [16,17], which misses the potential computing optimization opportunities of some stencil kernels.

Stencil Computation
FPGA with HBM Memory
Related Work
Stencil Computation Architecture
Sliding Window Buffer Design Approaches
Scaling along the x Dimension of the Target Stencil Space
Scaling along the y Dimension of the Target Stencil Space
Hybrid Scaling Strategy
Proposed Architecture Overview
HBM Memory Bandwidth Optimization
Performance Model
Limitation
Experiment Setup
Experiment Performance
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call