A Monolithic 3D Hybrid Architecture for Energy-Efficient Computation

Ye Yu,Niraj K Jha

doi:10.1109/tmscs.2018.2882433

Ye Yu, Niraj K Jha

Open Access

https://doi.org/10.1109/tmscs.2018.2882433

Copy DOI

Abstract

The exponentially increasing performance of chip multiprocessors (CMPs) predicted by Moore's Law is no longer due to the increasing clock rate of a single CPU core, but on account of the increase of core counts in the CMP. More transistors are integrated within the same footprint area as the technology node shrinks to deliver higher performance. However, this is accompanied by higher power dissipation that usually exceeds the coping capability of inexpensive cooling techniques. This Power Wall prevents the chip from running at full speed with all the devices powered-on. This is known as the dark silicon problem. Another major bottleneck in CMP development is the imbalance between the CPU clock rate and memory access speed. This Memory Wall keeps the CPU from fully utilizing its compute power. To address both the Power and Memory Walls, we propose a monolithic 3D hybrid architecture that consists of a multi-core CPU tier, a fine-grain dynamically reconfigurable (FDR) field-programmable gate array (FPGA) tier, and multiple resistive RAM (RRAM) tiers. The FDR tier is used as an accelerator. It uses the concept of temporal logic folding to localize on-chip communication. The RRAM tiers are connected to the CPU and FDR tiers through an efficient memory interface that takes advantage of the tremendous bandwidth available from monolithic inter-tier vias and hides the latency of large data transfers. We evaluate the architecture on two types of benchmarks: compute-intensive and memory-intensive. We show that the architecture reduces both power and energy significantly at a better performance for both types of applications. Compared to the baseline, our architecture achieves an average of 43.1× and 2.5× speedup on compute-intensive and memory-intensive benchmarks, respectively. The power and energy consumption are reduced by 5.0× and 40.5×, respectively, for compute-intensive applications, and 2.0× and 4.2×, respectively, for memory-intensive applications. This translates to 1745.3× energy-delay product (EDP) improvement for compute-intensive applications and 10.5× for memory-intensive applications.

Full Text