Abstract

Graphics Processing Units (GPUs) have become dominant accelerators for Machine Learning (ML) and High-Performance Computing (HPC) applications due to their massive parallelism capabilities, through the utilization of general matrix-to-matrix multiplication (GEMM) kernels. However, GEMM kernels often suffer from duplicated memory requests, mainly caused by matrix tiling used for handling large matrices. While GPUs have adopted programmable shared memory to mitigate this issue by preserving frequently reused data in shared memory, GEMM still introduces duplication in register files. Our observations show that the matrix tiling issues memory requests to the same shared memory address for neighboring threads, and this results in a substantial increase in the number of duplicated data in the register files. Such duplication degrades GPU performance by limiting warp-level parallelism due to the register shortage and redundant memory requests to shared memory. We find that the data duplication can be categorized into two types that occur with fixed patterns during the matrix tiling. Based on these observations, we introduce SHREG, an architecture design that enables different threads to share registers for overlapped data from shared memory, effectively reducing duplicated data within the register files. By leveraging the duplication patterns, SHREG utilizes register sharing and improves performance with minimal hardware overhead. Our evaluation shows that SHREG improves performance by 31.4% on various ML applications over the baseline GPU.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call