Performance Engineering for Exascale-Enabled Sparse Linear Algebra Building Blocks

Moritz Kreutzer

doi:10.25593/978-3-96147-104-1

Abstract

The increasing demand for solving larger and more complex problems in computational science and engineering is a major driving factor to deploy computer systems with ever-advancing performance capabilities. To increase the available performance, modern High-Performance Computing (HPC) platforms come with multiple levels of parallelism, complex memory hierarchies, heterogeneous architectures, and extreme scales. To match the need for sustainable and efficient software under these premises, special value has to be attached to the inherent challenges like efficiency on all scales and performance portability across heterogeneous architectures. This work addresses the development of high-performance scientific software for sparse linear algebra, which is an important field of research and forms the foundation of many applications of computational science and engineering, with a special focus on sparse eigenvalue solvers on current and future supercomputers. One of the most prominent building blocks of sparse linear algebra is Sparse Matrix-Vector Multiplication (SpMV). In this work, a platform-agnostic storage format for high-performance general SpMV is developed: SELL-C-σ. Based on existing, device-specific formats, its design is justified and best practices for the selection of tuning parameters are provided. It is demonstrated that SpMV using a unified SELL-C-σ matrix usually not only yields competitive performance but even surpasses device-specific formats and implementations for many test cases. Sparse linear algebra algorithms can often be formulated using blocks of vectors instead of single vectors. This technique is in some cases motivated by numerical benefits, but it also contains performance optimization potential. In this work, the shift of hardware bottlenecks is analyzed and highly efficient implementations of block vector kernels are presented. Kernel fusion is employed in the present work on top of optimized basic building blocks, leading to custom compute kernels for sparse linear algebra algorithms bearing significant performance gains. The performance engineering process consistently implemented for all software development efforts in this work – including code analysis, benchmarking, the evaluation of hardware metrics, and code optimization – is closely guided by performance models. Performance modeling is an indispensable ingredient for the development of high-performance software components, as it helps in revealing hardware bottlenecks, identifying suitable optimization techniques, and assessing the efficiency of a present implementation on a given hardware platform. To be accessible by a broader community, all software development efforts conducted within this work are combined into the scalable open-source software library GHOST. To demonstrate the applicability of the developed software components, full-application performance of several sparse eigenvalue solvers for real-world problems on some of the world’s largest supercomputers with completely different hardware architectures is presented. Combining all of the developed building blocks and techniques, solutions to the standard eigenvalue problem for relevant quantum physics applications are acquired. At the largest scale, matrices with up to 26 billion rows and 7 terabytes of raw data are investigated on thousands of homogeneous and heterogeneous compute nodes with hundreds of TFLOP/s sustained performance and verifiable high efficiency from the single node to the extreme scale.

Full Text