Exploring high performance deep neural networks on GPUs

Shi Dong

doi:10.17760/d20384365

Abstract

Over the past few decades, Machine Learning (ML) has gained unprecedented popularity, becoming a pervasive technology that has benefitted a broad range of domains such as market analysis, environmental science, medical research, and material science. Among the many ML algorithms, Deep Learning (DL) has been shown to be effective in terms of accuracy, becoming the primary focus of attention for many leading research and industrial labs. The primary algorithm used in DL is a Deep Neural Network (DNN), a deep-structured multilayer Artificial Neural Network (ANN). To achieve better accuracy, researchers keep adding more layers and parameters when building a DNN model. As a result, new computing challenges emerge as a larger DNN model places more demands on computing and memory resources, especially during the training phase. Given their massive degree of parallelism, GPUs are a good fit for accelerating DNN training. However, DNN execution on a GPU can be highly inefficient, limiting their potential to accelerate deeper DNN models. We need to develop more efficient hardware acceleration in order to enable future scalability of DNNs. There are three fundamental challenges faced when running high performance DNNs on a GPU. First, there is a void of tools available to measure and tune the performance of DNN computations on a GPU. Second, there is limited prior work with regards to characterizing bottlenecks present in this class of workloads when targeting GPUs. Lastly, existing methods that utilize sparsity and model compression have targeted FPGAs and ASICs, but not on GPUs. DNNs run on GPUs exhibit unique execution patterns. Directly applying prior techniques developed for FPGAs and ASICs can lead to significant inefficiencies, potentially degrading performance. In this dissertation, we develop DNNMark, a GPU benchmark suite that consists of a collection of DNN primitives, covering a rich set of GPU computing patterns. This suite is designed to be a highly-configurable, extensible, and flexible framework in which benchmarks can run either individually or collectively. Next, we characterize the performance bottlenecks present in a Convolution Neural Network models by considering microarchitectural-level bottlenecks on a layer-by-layer basis. We also characterize the memory access behavior in the context of a typical GPU memory hierarchy. Furthermore, we present the design of Spartan, a lightweight hardware/software framework to accelerate DNN training on a GPU. Spartan provides a cost-effective and programmertransparent microarchitectural solution to exploit the sparsity detected during training. Spartan consists of three components: i) a sparsity monitor that intelligently acquires and tracks activation sparsity with negligible overhead; ii) a tile-based sparse GEMM algorithm that leverages a new sparse representation, namely ELLPACK- DIB; and iii) a novel compaction engine designed specifically for GPU workloads to support dynamic compaction of sparse data into the ELLPACK-DIB format. Last, we explore acceleration of DNNs using Block Circulant Matrices (BCMs), a model compression technique. We identify the GPU-specific challenges posed by using BCMs. Then, we perform both general and GPU-specific optimizations that impact: i) the decomposition and interaction of individual operations required in BCM algorithm, and ii) the overall GPU kernel design.

Full Text