In the current chip-multiprocessor era, 3D-stacked DRAM became an attractive alternative to mitigate the DRAM bandwidth wall problem. In a chip-multiprocessor, the 3D-stacked DRAM is architect either (a) to cache both local and remote data or (b) to cache only the local data. Caching only local data into the 3D-stacked DRAM enforces the chip-multiprocessors to suffer inter-node latency overhead while accessing remote data. However, caching both local and remote data onto the 3D-stacked DRAM requires a large coherence directory (tens of MBs) to ensure correctness. In this paper, we consider a 3D-stacked DRAM based chip-multiprocessor and perform a comparative study between (a) high level adaptive run-time data page mapping onto DRAM with an auxiliary small SRAM buffer as a performance booster, and (b) DRAM used as coherent cache. Our experiment on a 64 core chip-multiprocessor system with 4GB of 3D-stacked DRAM shows that our adaptive run-time data page mapping on DRAM along with an SRAM buffer outperforms the base-case (where DRAM caches only local data) by an average of 48%. Moreover, our method shows a performance improvement by an average of 40% when compared with a recent state-of-art work (where DRAM caches both local and remote data).
Read full abstract