A Scalable Many-core Overlay Architecture on an HBM2-enabled Multi-Die FPGA

Riadh Ben Abdelhamid,Yoshiki Yamaguchi,Taisuke Boku

doi:10.1145/3547657

Abstract

The overlay architecture enables to raise the abstraction level of hardware design and enhances hardware-accelerated applications’ portability. In FPGAs, there is a growing awareness of the overlay structure as typified by many-core architecture. It works in theory; however, it is difficult in practice, because it is beset with serious design issues. For example, the size of FPGAs is bigger than before. It is exacerbating the issue of the place-and-route. Besides, a single FPGA is actually the sum of small-to-middle FPGAs by advancing packaging technology like silicon interposers. Thus, the tightly coupled many-core designs will face this covert issue that the wires among the regions are extremely restricted. This article proposes efficient essential processing elements, micro-architecture design, and the interconnect architecture toward a scalable many-core overlay design. In particular, our work proposes a novel compact buffering technique to reduce memory resource utilization in tightly connected overlays while preserving computational efficiency. This technique reduces the utilization of BlockRAM to nearly 50% while achieving a best-case computational efficiency of 91.93% in a three-dimensional Jacobi benchmark. Besides, the proposed enhancements led to around 2× and 3× improvement in performance and power efficiency, respectively. Moreover, the improved scalability allowed increasing compute resources and delivering around 4× better performance and power efficiency, as compared to the baseline Dynamically Re-programmable Architecture of Gather-scatter Overlay Nodes overlay.

Full Text