Abstract

This article presents a graphics processing unit (GPU) scheduling scheme that maximizes the exploitation of data locality in deep neural networks (DNNs). Convolution is one of the fundamental operations used in DNNs and accounts for more than 90% of the total execution time. To leverage massive thread-level parallelism (TLP) in a GPU, deeply nested convolution loops are lowered (or unrolled) into large matrix multiplication, which trades memory capacity and bandwidth for TLP augmentation. A large workspace matrix is split into tiles of general matrix multiplication (GEMM) and concurrently executed by many thread blocks. Notably, the workspace is filled with a number of duplicate data that originate from the same sources in the input feature map during the lowering process. However, conventional GPU scheduling is oblivious to data duplication patterns in the workspace, and thread blocks are assigned to streaming multiprocessors (SMs) irrespective of data similarity between GEMM tiles. Such scheduling misses a significant opportunity to exploit data locality manifested in the DNN convolution. This article proposes a GPU scheduling technique called <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Locality-Aware Scheduling</i> (LAS) that i) identifies which thread blocks share the largest amount of identical data based on the lowered patterns of a DNN convolution and ii) allocates such thread blocks showing the greatest data similarity to the same SM. In this way, small caches in SMs can efficiently utilize the data locality of the DNN convolution. Experimental results show that LAS with tensor cores achieves 20.1% performance improvements on average with 14.8% increases in L1 cache hit rates.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call