BRLoop: Constructing balanced retimed loop to architect STT-RAM-based hybrid cache for VLIW processors

Keni Qiu,Yujie Zhu,Yuanchao Xu,Qirun Huo,Chun Jason Xue

doi:10.1016/j.mejo.2018.11.011

Abstract

The new emerging non-volatile memory technology of Spin Torque Transfer RAM (STT-RAM) has been proposed as a replacement for SRAM based cache. Recently its commercial step has been greatly boosted by big companies such as Samsung. Although STT-RAM has quite a few advantages such as nonvolatility, high density and extremely low leakage power consumption, it suffers high dynamic energy and long latency on write operations. Addressing this problem, researchers proposed a STT-RAM/SRAM hybrid structure to alleviate the side effect of write operations. In hybrid caches, a migration based technique is often adopted to explore the advantages of both parts of a hybrid cache by dynamically moving write-intensive and read-intensive data between STT-RAM and SRAM.Meanwhile, migrations also introduce extra reads and writes during data movements. For stencil loops with read and write data dependencies, it is observed that migration overhead is significant and migrations closely correlate to the interleaved read and write memory access pattern in a memory block. Loop retiming technique has proposed to reduce the migration overhead by changing the interleaved memory access pattern. It is known that loop retiming has been extensively studied to maximize instruction-level parallelism (ILP) of multiple function units by rearranging the dependence delays in a uniform loop. Both retiming techniques are conducted by changing the instruction dependence delays in a loop. However, this previous ILP-aware loop retiming is unaware of its impact on the hybrid cache's migration while the recent migration-aware loop retiming has not fully considered the parallelism of arithmetic and logical units (ALUs) in VLIW processors.It is sure that the impacts of retiming on both the migration overhead of hybrid cache and ILP of VLIW should be considered when architecting STT-RAM-based hybrid cache for VLIW processors. Addressing this issue, this paper models the impacts of loop retiming on both ILP of ALUs and migration overhead in STT-RAM/SRAM hybrid cache. An overall balanced loop retiming solution, considering both of the ALU part and the memory part, is devised to achieve high performance for VLIW processors. The experimental results across a set of benchmarks show that the proposed optimal and heuristic balanced retiming approaches can effectively improve the overall system performance over the cases with no retiming, pure migration-aware retiming and pure ILP-aware retiming, respectively.

Full Text