Abstract

Coarse-Grained Reconfigurable Arrays (CGRAs) are a class of reconfigurable architectures that inherit the performance and usability properties of Central Processing Units (CPUs) and the reconfigurability aspects of Field-Programmable Gate Arrays (FPGAs). Historically, CGRAs have been successfully used to accelerate embedded applications and are today also being considered to accelerate High-Performance Computing (HPC) applications in future supercomputers. However, embedded systems and supercomputers are two vastly different domains with different applications and constraints, and it is today not fully understood what CGRA design decisions adequately cater to the HPC market. One such unknown design decision is regarding the interconnect that facilitates intra-CGRA communication. Today, intra-CGRA communication comes in two flavors: using routers closely embedded into the compute units or using discrete routers outside the compute units. The former trades flexibility for a reduction in hardware cost, while the latter has greater flexibility but is more resource hungry. In this paper, we aspire to understand which of both designs best suits the CGRA HPC segment. We extend our previous methodology, which consists of both a parameterized CGRA design and an OpenMPcapable compiler, to accommodate both types of routing designs, including verification tests using RTL simulation. Our results show that the discrete router design can facilitate better use of processing elements (PEs) compared to embedded routers and can achieve up to 79.27% reduction in unnecessary PE occupancy for an aggressively unrolled stencil kernel on a 18 × 16 CGRA at a (estimated) hardware resource overhead cost of 6.3x. This reduction in PE occupancy can be used, for example, to exploit instruction-level parallelism (ILP) through even more aggressive unrolling.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call