Exploring high performance sparse operations on GPUs

Elmira Karimi

doi:10.17760/d20471051

Abstract

Sparse matrix linear algebra operations represent one of the most important class ofcomputational kernels due to their popularity in scientific and engineering applications. These operations are commonly used in data mining, graph analytics and machine learning applications. A naive implementation of these operations can suffer from poor performance due to the nature of the data structures used to represent the matrices. The main cause of these problems is related to the inherent irregular data accesses, resulting in low utilization of memory bandwidth and high cache miss rates. As a result, sparse operations remain as one of the main bottlenecks in sparse linear algebra computations. Given the nature of matrix-based operations, a massive degree of parallelism exists. There-fore, GPUs are a good fit for accelerating these kernels. GPUs achieve high throughput by leveraging thread-level parallelism, efficient thread swapping and high memory bandwidth. The memory system on a GPU differs significantly from what we find on a typical CPU. Instead of a deep (3-4 level) cache hierarchy found on CPUs, GPUs have many L1 caches and only a single, large, multi-banked L2 cache. The L1 caches are equipped with coalescing units, which can detect spatial locality across multiple cores and reduce the number of memory requests to a unique cache block. When performing sparse operations targeting a GPU, we need to carefully consider the memory hierarchy of the GPU, avoiding optimizations that only work well on CPUs. We should instead consider optimizations better suited for the high degree of thread-level parallelism present on GPUs. In this thesis, we will explore the various options available in terms of the underlyingdata structures used for sparse matrices. Our goal is to optimize the execution of sparse matrix computations on GPUs. We first evaluate the memory hierarchy commonly found on GPUs. This provides us with guidance on how to optimize performance. We next consider existing sparse matrix formats and identify shortcomings in them in terms memory behavior and format conversion overhead. We then present our own format, specifically tailored for GPUs. Our general approach is called Vertical Compressed Sparse Row (VCSR), a format that focuses on achieving both high thread-level parallelism and high memory bandwidth utilization, while also minimizing compaction overhead. VCSR is designed to leverage the GPU's coalescing unit, which can reduce the number of global memory transactions significantly. We demonstrate the efficiency of our compressed matrix format, developing an optimized Sparse Matrix-Vector Multiplication (SpMV) kernel that uses VCSR. Next, we shift our focus on leveraging the same approach to optimize Sparse Matrix-MatrixMultiplication (SpMM), which has different memory requirements as compared to SpMV. In this kernel, in addition to leveraging VCSR's ability to handle memory request patterns associated with input sparse matrices, the reusability of the second matrix (typically a dense matrix) adds a second challenge. We redesign our existing SpMV kernel implementation to achieve high reusability, as well as adopt the VCSR sparse matrix format to fully optimize this kernel. Finally, we focus on leveraging VCSR-based formats for sparse operations in numericalsolvers. As matrices in numerical solvers commonly works with diagonal sparse matrices, we adopted VCSR-Baseline to the DIA format and propose VCSR-DIA. This format increases the performance of for perfectly diagonal matrices. For partially diagonal formats, DIA-based format are highly memory inefficient. For these matrices, we propose VCSR-HYB, which is a combination of the VCSR-DIA and VCSR-baseline. In order to convert any format to VCSR-HYB we use a CNN to decide whether to use VCSR-DIA or VCSR-baseline. We evaluate the performance of both VCSR-DIA and VCSR-HYB and present the efficiency of those formats.--Author's abstract

Full Text