Use of modern GPUs has been extended from traditional 3D graphic processing to computing acceleration of many scientific, engineering, and enterprise applications. In modern GPUs, on-chip memory capacity keeps increasing to support thousands of chip-resident threads. For example, a large register file is needed in order to efficiently process highly-parallel threads in single instruction multiple thread (SIMT) fashion, and a large shared memory is often implemented to allow data sharing among the threads on the chip. On-chip memory capacity of GPUs, however, is highly constrained by large memory cell area and high static power consumption of conventional SRAM implementation. In this work, we propose to utilize the emerging multi-level cell (MLC) spin-transfer torque RAM (STT-RAM) technology to implement register file and shared memory in GPUs. Compared to SRAM, MLC STT-RAM (or MLC-STT) has a much smaller cell area as well as ultra-low standby power, thanks to the non-volatility of MLC-STT technology. Hence, the footprint and leakage power of the implemented memory components are substantially reduced. Moreover, in light of asymmetric performance of soft and hard bits of a MLC-STT cell, we propose a dynamic data remapping strategy in register file and shared memory implementations that allows a flexible tradeoff between the memory access time and the available capacity: frequently-accessed data is always mapped to the fast rows built with the soft bits of the MLC-STT cells while the slow rows composed of the hard bits are used only when a larger capacity is critically needed. We also develop a novel rescheduling scheme to minimize the waiting time of the issued warps to access register banks in the register file, which is induced by the long writeback operations through the reordering of the issued warps. Finally, an early termination technology is also applied to save the write energy of the shared memory if the bits of the memory do not flip. Experimental results on benchmarks of ISPASS2009, Rodinia, Parboil, and CUDA show that on average, MLC-STT register file can achieve 3.28% system performance improvement, 9.48% energy reduction, and 38.9% energy efficiency improvement compared to conventional SRAM-based design. Meanwhile, MLC-STT shared memory leads to 3.45% system performance improvement, 49.3% energy reduction, and 116% energy efficiency improvement.
Read full abstract