Performance-centric register file design for GPUs using racetrack memory

Shuo Wang Shuo Wang,Xiaolong Xie Xiaolong Xie,Yongpan Liu Yongpan Liu,Yun Liang Yun Liang,Xiuhong Li Xiuhong Li,Guangyu Sun Guangyu Sun,Chao Zhang Chao Zhang,Yu Wang Yu Wang

doi:10.1109/aspdac.2016.7427984

Abstract

The key to high performance for GPU architecture lies in massive threading to drive the large number of cores and enable overlapping of threading execution. However, in reality, the number of threads that can simultaneously execute is often limited by the size of the register file on GPUs. The traditional SRAM-based register file costs so large amount of chip area that it cannot scale to meet the increasing demand of massive threading for GPU applications. Racetrack memory is a promising technology for designing large capacity register file on GPUs due to its high data storage density. However, without careful deployment of registers, the lengthy shift operation of racetrack memory may hurt the performance. In this paper, we explore racetrack memory for designing high performance register file for GPU architecture. High storage density racetrack memory helps to improve the thread level parallelism, i.e., the number of threads that simultaneously execute. However, if the bits of the registers are not aligned to the ports, shift operations are required to move the bits to the ports. To mitigate the shift operation overhead problem, we develop a register file preshifting strategy and a compile-time managed register mapping algorithm. Experimental results demonstrate that our technique achieves up to 24% (19% on average) improvement in performance for a variety of GPU applications.

Full Text