GPGPU accelerated computing has revolutionized a broad range of applications. To serve between the ever-growing computing capability and external memory, the on-chip memory is becoming increasingly important to GPGPU performance for general-purpose computing. Inherited from the traditional CPUs, however, the contemporary GPGPU on-chip memory design is suboptimal to the SIMT (single instruction, multiple threads) execution. In particular, the on-chip first-level data (L1D) cache thrashing, resulting from insufficient capacity and imbalanced usage, leads to a low hit rate and limits the overall performance. In this study, we reform the contemporary on-chip memory design and propose an integrated and balanced on-chip memory (IBOM) architecture for high-performance GPGPUs. It first virtually enlarges the L1D cache size by an integrated architecture that exploits the under-utilized register file (RF) with lightweight ISA, compiler and microarchitecture supports. Then with sufficient capacity, it is able to improve the cache usage by a set balancing technique that exploits the under-utilized set resources. In our proposed IBOM design, the register and cache accesses are amenable to normal pipeline operations with simple changes. It adequately exploits the size inversion in GPGPU on-chip memory, and enables optimized utilization of the precious resources for higher performance and energy efficiency with even smaller on-chip memory size. The experiment results demonstrate that the proposed IBOM design can offer an average of 29.6 percent increase in L1D hit rate and in turn 3X performance improvement for the cache-sensitive applications.
Read full abstract